简体   繁体   English

如何从 mongodb 导入数据到 pandas?

[英]How to import data from mongodb to pandas?

I have a large amount of data in a collection in mongodb which I need to analyze.我在 mongodb 的集合中有大量数据需要分析。 How do i import that data to pandas?如何将该数据导入 pandas?

I am new to pandas and numpy.我是 pandas 和 numpy 的新手。

EDIT: The mongodb collection contains sensor values tagged with date and time.编辑:mongodb 集合包含标记有日期和时间的传感器值。 The sensor values are of float datatype.传感器值是浮点数据类型。

Sample Data:样本数据:

{
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
    "SensorReport"
],
"Readings" : [
    {
        "a" : 0.958069536790466,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
        "b" : 6.296118156595,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95574014778624,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
        "b" : 6.29651468650064,
        "_cls" : "Reading"
    },
    {
        "a" : 0.953648289182713,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
        "b" : 7.29679823731148,
        "_cls" : "Reading"
    },
    {
        "a" : 0.955931884300997,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
        "b" : 6.29642922525632,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95821381,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
        "b" : 7.28956613,
        "_cls" : "Reading"
    },
    {
        "a" : 4.95821335,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
        "b" : 6.28956574,
        "_cls" : "Reading"
    },
    {
        "a" : 9.95821341,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
        "b" : 0.28956488,
        "_cls" : "Reading"
    },
    {
        "a" : 1.95667927,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
        "b" : 0.29115237,
        "_cls" : "Reading"
    }
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}

pymongo might give you a hand, followings are some codes I'm using: pymongo可能会帮助您,以下是我正在使用的一些代码:

import pandas as pd
from pymongo import MongoClient


def _connect_mongo(host, port, username, password, db):
    """ A util for making a connection to mongo """

    if username and password:
        mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
        conn = MongoClient(mongo_uri)
    else:
        conn = MongoClient(host, port)


    return conn[db]


def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
    """ Read from Mongo and Store into DataFrame """

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

You can load your mongodb data to pandas DataFrame using this code.您可以使用此代码将 mongodb 数据加载到 Pandas DataFrame。 It works for me.这个对我有用。 Hopefully for you too.希望对你也有帮助。

import pymongo
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.database_name
collection = db.collection_name
data = pd.DataFrame(list(collection.find()))

Monary does exactly that, and it's super fast . Monary正是这样做的,而且速度非常快 ( another link ) 另一个链接

See this cool post which includes a quick tutorial and some timings.请参阅这篇很酷的帖子,其中包括一个快速教程和一些时间安排。

As per PEP, simple is better than complicated:根据 PEP,简单总比复杂好:

import pandas as pd
df = pd.DataFrame.from_records(db.<database_name>.<collection_name>.find())

You can include conditions as you would working with regular mongoDB database or even use find_one() to get only one element from the database, etc.您可以像使用常规 mongoDB 数据库一样包含条件,甚至可以使用 find_one() 从数据库中仅获取一个元素等。

and voila!瞧!

import pandas as pd
from odo import odo

data = odo('mongodb://localhost/db::collection', pd.DataFrame)

Another option I found very useful is:我发现非常有用的另一个选项是:

from pandas.io.json import json_normalize

cursor = my_collection.find()
df = json_normalize(cursor)

(or json_normalize(list(cursor)) , depending on your python/pandas versions). (或json_normalize(list(cursor)) ,取决于您的 python/pandas 版本)。

This way you get the unfolding of nested mongodb documents for free.通过这种方式,您可以免费展开嵌套的 mongodb 文档。

For dealing with out-of-core (not fitting into RAM) data efficiently (ie with parallel execution), you can try Python Blaze ecosystem : Blaze / Dask / Odo.为了有效地处理核外(不适合 RAM)数据(即并行执行),您可以尝试Python Blaze 生态系统:Blaze / Dask / Odo。

Blaze (and Odo ) has out-of-the-box functions to deal with MongoDB. Blaze(和Odo )具有处理 MongoDB 的开箱即用功能。

A few useful articles to start off:一些有用的文章开始:

And an article which shows what amazing things are possible with Blaze stack: Analyzing 1.7 Billion Reddit Comments with Blaze and Impala (essentially, querying 975 Gb of Reddit comments in seconds).还有一篇文章展示了 Blaze 堆栈可能带来的惊人事情:使用 Blaze 和 Impala 分析 17 亿条 Reddit 评论(本质上,在几秒钟内查询 975 Gb 的 Reddit 评论)。

PS I'm not affiliated with any of these technologies. PS 我不隶属于任何这些技术。

Using使用

pandas.DataFrame(list(...))

will consume a lot of memory if the iterator/generator result is large如果迭代器/生成器结果很大,将消耗大量内存

better to generate small chunks and concat at the end最好生成小块并在最后连接

def iterator2dataframes(iterator, chunk_size: int):
  """Turn an iterator into multiple small pandas.DataFrame

  This is a balance between memory and efficiency
  """
  records = []
  frames = []
  for i, record in enumerate(iterator):
    records.append(record)
    if i % chunk_size == chunk_size - 1:
      frames.append(pd.DataFrame(records))
      records = []
  if records:
    frames.append(pd.DataFrame(records))
  return pd.concat(frames)

http://docs.mongodb.org/manual/reference/mongoexport http://docs.mongodb.org/manual/reference/mongoexport

export to csv and use read_csv or JSON and use DataFrame.from_records()导出到 csv 并使用read_csv或 JSON 并使用DataFrame.from_records()

Following this great answer by waitingkuo I would like to add the possibility of doing that using chunksize in line with .read_sql() and .read_csv() .遵循waitkuo的这个很好的答案,我想添加使用与.read_sql().read_csv() 一致的chunksize 来做到这一点的可能性。 I enlarge the answer from Deu Leung by avoiding go one by one each 'record' of the 'iterator' / 'cursor'.我通过避免将“迭代器”/“光标”的每条“记录”一一列出来扩大Deu Leung的答案。 I will borrow previous read_mongo function.我将借用之前的read_mongo函数。

def read_mongo(db, 
           collection, query={}, 
           host='localhost', port=27017, 
           username=None, password=None,
           chunksize = 100, no_id=True):
""" Read from Mongo and Store into DataFrame """


# Connect to MongoDB
#db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
client = MongoClient(host=host, port=port)
# Make a query to the specific DB and Collection
db_aux = client[db]


# Some variables to create the chunks
skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize))
if len(skips_variable)<=1:
    skips_variable = [0,len(skips_variable)]

# Iteration to create the dataframe in chunks.
for i in range(1,len(skips_variable)):

    # Expand the cursor and construct the DataFrame
    #df_aux =pd.DataFrame(list(cursor_aux[skips_variable[i-1]:skips_variable[i]]))
    df_aux =pd.DataFrame(list(db_aux[collection].find(query)[skips_variable[i-1]:skips_variable[i]]))

    if no_id:
        del df_aux['_id']

    # Concatenate the chunks into a unique df
    if 'df' not in locals():
        df =  df_aux
    else:
        df = pd.concat([df, df_aux], ignore_index=True)

return df

A similar approach like Rafael Valero, waitingkuo and Deu Leung using pagination :使用分页的类似方法,如 Rafael Valero、waitkuo 和 Deu Leung:

def read_mongo(
       # db, 
       collection, query=None, 
       # host='localhost', port=27017, username=None, password=None,
       chunksize = 100, page_num=1, no_id=True):

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Calculate number of documents to skip
    skips = chunksize * (page_num - 1)

    # Sorry, this is in spanish
    # https://www.toptal.com/python/c%C3%B3digo-buggy-python-los-10-errores-m%C3%A1s-comunes-que-cometen-los-desarrolladores-python/es
    if not query:
        query = {}

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query).skip(skips).limit(chunksize)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

You can achieve what you want with pdmongo in three lines:您可以使用pdmongo在三行中实现您想要的:

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [], "mongodb://localhost:27017/mydb")

If your data is very large, you can do an aggregate query first by filtering data you do not want, then map them to your desired columns.如果您的数据非常大,您可以先通过过滤您不想要的数据来进行聚合查询,然后将它们映射到您想要的列。

Here is an example of mapping Readings.a to column a and filtering by reportCount column:以下是将Readings.a映射到a列并按reportCount列过滤的示例:

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [{'$match': {'reportCount': {'$gt': 6}}}, {'$unwind': '$Readings'}, {'$project': {'a': '$Readings.a'}}], "mongodb://localhost:27017/mydb")

read_mongo accepts the same arguments as pymongo aggregate read_mongo接受与pymongo 聚合相同的参数

您还可以使用pymongoarrow——它是 MongoDB 提供的官方库,用于将 mongodb 数据导出到 Pandas、numPy、parquet 文件等。

You can use the "pandas.json_normalize" method :您可以使用“pandas.json_normalize”方法

import pandas as pd
display(pd.json_normalize( x ))
display(pd.json_normalize( x , record_path="Readings" ))

It should display two tables, where x is your cursor or:它应该显示两个表,其中 x 是您的光标或:

from bson import ObjectId
def ISODate(st):
    return st

x = {
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
    "SensorReport"
],
"Readings" : [
    {
        "a" : 0.958069536790466,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
        "b" : 6.296118156595,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95574014778624,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
        "b" : 6.29651468650064,
        "_cls" : "Reading"
    },
    {
        "a" : 0.953648289182713,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
        "b" : 7.29679823731148,
        "_cls" : "Reading"
    },
    {
        "a" : 0.955931884300997,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
        "b" : 6.29642922525632,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95821381,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
        "b" : 7.28956613,
        "_cls" : "Reading"
    },
    {
        "a" : 4.95821335,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
        "b" : 6.28956574,
        "_cls" : "Reading"
    },
    {
        "a" : 9.95821341,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
        "b" : 0.28956488,
        "_cls" : "Reading"
    },
    {
        "a" : 1.95667927,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
        "b" : 0.29115237,
        "_cls" : "Reading"
    }
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}
  1. Start mongo in shell with: mongosh在 shell 中启动 mongo: mongosh

  2. Scroll up on shell until you see where mongo is connected to.在 shell 上向上滚动,直到看到 mongo 连接到的位置。 It should look something like this: mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.5.4它应该看起来像这样: mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.5.4

  3. Copy and paste that into mongoclient将其复制并粘贴到 mongoclient

  4. Here is the code:这是代码:

from pymongo import MongoClient
import pandas as pd

client = MongoClient('mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.5.4')

mydatabase = client.yourdatabasename
mycollection = mydatabase.yourcollectionname
cursor = mycollection.find()
listofDocuments = list(cursor)
df = pd.DataFrame(listofDocuments)
df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM