从mongo导入具体数据到pandas dataframe

Question

I have a large amount of data in a collection in mongodb which I need to analyze, using pandas and pymongo in jupyter.我在 mongodb 中的一个集合中有大量数据需要分析，使用 pandas 和 jupyter 中的 pymongo。 I am trying to import specific data in a dataframe.我正在尝试导入 dataframe 中的特定数据。

Sample data.样本数据。

{
    "stored": "2022-04-xx",
    ...
    ...
    "completedQueues": [
        "STATEMENT_FORWARDING_QUEUE",
        "STATEMENT_PERSON_QUEUE",
        "STATEMENT_QUERYBUILDERCACHE_QUEUE"
    ],
    "activities": [
        "https://example.com
    ],
    "hash": "xxx",
    "agents": [
        "mailto:example@example.com"
    ],
    "statement": {                                  <=== I want to import the data from "statement"
        "authority": {
            "objectType": "Agent",
            "name": "xxx",
            "mbox": "mailto:example@example.com"
        },
        "stored": "2022-04-xxx",
        "context": {
            "platform": "Unknown",
            "extensions": {
                "http://example.com",
                "xxx.com": {
                    "user_agent": "xxx"
                },
                "http://example.com": ""
            }
        },
        "actor": {
            "objectType": "xxx",
            "name": "xxx",
            "mbox": "mailto:example@example.com"
        },
        "timestamp": "2022-04-xxx",
        "version": "1.0.0",
        "id": "xxx",
        "verb": {
            "id": "http://example.com",
            "display": {
                "en-US": "viewed"
            }
        },
        "object": {
            "objectType": "xxx",
            "id": "https://example.com",
            "definition": {
                "type": "http://example.com",
                "name": {
                    "en-US": ""
                },
                "description": {
                    "en-US": "Viewed"
                }
            }
        }
    },                                             <=== up to here
    "hasGeneratedId": true,
    ...
    ...
}

Notice that I am only interested in data nested under "statement", and not in any data containing the string, ie the "STATEMENT_FORWARDING_QUEUE" above it.请注意，我只对嵌套在“语句”下的数据感兴趣，而不对包含字符串的任何数据感兴趣，即它上面的“STATEMENT_FORWARDING_QUEUE”。

What I am trying to accomplish is import the data from "statement" (as indicated above) in a dataframe, and arrange them in a manner, like:我想要完成的是从 dataframe 中的“声明”（如上所示）导入数据，并以如下方式排列它们：

id ID	authority objectType权限对象类型	authority name权威名称	authority mbox权限 mbox	stored存储	context platform语境平台	context extensions上下文扩展	actor objectType演员对象类型	actor name演员姓名	... ...
00 00	Agent代理人	xxx xxx	mailto邮箱	2022- 2022-	Unknown未知	http://1 http://1	xxx xxx	xxx xxx	... ...
01 01	Agent代理人	yyy yyy	mailto邮箱	2022- 2022-	Unknown未知	http://2 http://2	yyy yyy	yyy yyy	... ...

The idea is to be able to access any data like "authority name" or "actor objectType".这个想法是能够访问任何数据，如“权限名称”或“参与者对象类型”。

I have tried:我努力了：

df = pd.DataFrame(list(collection.find(query)(filters)))
df = json_normalize(list(collection.find(query)(filters)))

with various queries, filter and slices, and also aggregate and map/reduce, but nothing produces the correct output.使用各种查询、过滤器和切片，以及聚合和映射/减少，但没有产生正确的 output。

I would also like to sort (newest to oldest) based on the "stored" field (sort('$natural',-1)?), and maybe apply limit(xx) to the dataframe as well.我还想根据“存储”字段（sort('$natural',-1)?）进行排序（从最新到最旧），并且也可能将 limit(xx) 应用于 dataframe。

Any ideas?有任何想法吗？

Thanks in advance.提前致谢。

Answer 1

Try this尝试这个

df = json_normalize(list(
    collection.aggregate([
        {
            "$match": query
        },
        {
            "$replaceRoot": {
                "newRoot": "$statement"
            }
        }
    ])
)

Answer 2

Thanks for the answer, @pavel.感谢您的回答，@pavel。 It is spot on and pretty much solves the problem.它是正确的，几乎可以解决问题。

I also added sorting and limit, so if anyone is interested, the final code looks like this:我还添加了排序和限制，所以如果有人感兴趣，最终代码如下所示：

df = json_normalize(list(
  statements_coll.aggregate([
    {
        "$match": query
    },
    {
        "$replaceRoot": {
            "newRoot": "$statement"
        }
    },
    { 
        "$sort": { 
            "stored": -1 
        }
    },
    {
        "$limit": 10 
    }
  ]) 
))

从mongo导入具体数据到pandas dataframe

问题描述

2 个解决方案

解决方案1
0 已采纳 2022-04-17 03:06:03

解决方案2
0 2022-04-17 10:55:02

从mongo导入具体数据到pandas dataframe

问题描述

2 个解决方案

解决方案1 0 已采纳 2022-04-17 03:06:03

解决方案2 0 2022-04-17 10:55:02

解决方案1
0 已采纳 2022-04-17 03:06:03

解决方案2
0 2022-04-17 10:55:02