如何從 mongodb 導入數據到 pandas？

Question

我在 mongodb 的集合中有大量數據需要分析。 如何將該數據導入 pandas？

我是 pandas 和 numpy 的新手。

編輯：mongodb 集合包含標記有日期和時間的傳感器值。 傳感器值是浮點數據類型。

樣本數據：

{
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
    "SensorReport"
],
"Readings" : [
    {
        "a" : 0.958069536790466,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
        "b" : 6.296118156595,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95574014778624,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
        "b" : 6.29651468650064,
        "_cls" : "Reading"
    },
    {
        "a" : 0.953648289182713,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
        "b" : 7.29679823731148,
        "_cls" : "Reading"
    },
    {
        "a" : 0.955931884300997,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
        "b" : 6.29642922525632,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95821381,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
        "b" : 7.28956613,
        "_cls" : "Reading"
    },
    {
        "a" : 4.95821335,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
        "b" : 6.28956574,
        "_cls" : "Reading"
    },
    {
        "a" : 9.95821341,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
        "b" : 0.28956488,
        "_cls" : "Reading"
    },
    {
        "a" : 1.95667927,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
        "b" : 0.29115237,
        "_cls" : "Reading"
    }
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}

Answer 1

pymongo可能會幫助您，以下是我正在使用的一些代碼：

import pandas as pd
from pymongo import MongoClient


def _connect_mongo(host, port, username, password, db):
    """ A util for making a connection to mongo """

    if username and password:
        mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
        conn = MongoClient(mongo_uri)
    else:
        conn = MongoClient(host, port)


    return conn[db]


def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
    """ Read from Mongo and Store into DataFrame """

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

Answer 2

您可以使用此代碼將 mongodb 數據加載到 Pandas DataFrame。 這個對我有用。 希望對你也有幫助。

import pymongo
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.database_name
collection = db.collection_name
data = pd.DataFrame(list(collection.find()))

Answer 3

Monary正是這樣做的，而且速度非常快。 （另一個鏈接）

請參閱這篇很酷的帖子，其中包括一個快速教程和一些時間安排。

Answer 4

根據 PEP，簡單總比復雜好：

import pandas as pd
df = pd.DataFrame.from_records(db.<database_name>.<collection_name>.find())

您可以像使用常規 mongoDB 數據庫一樣包含條件，甚至可以使用 find_one() 從數據庫中僅獲取一個元素等。

瞧！

Answer 5

import pandas as pd
from odo import odo

data = odo('mongodb://localhost/db::collection', pd.DataFrame)

Answer 6

我發現非常有用的另一個選項是：

from pandas.io.json import json_normalize

cursor = my_collection.find()
df = json_normalize(cursor)

（或json_normalize(list(cursor)) ，取決於您的 python/pandas 版本）。

通過這種方式，您可以免費展開嵌套的 mongodb 文檔。

Answer 7

為了有效地處理核外（不適合 RAM）數據（即並行執行），您可以嘗試Python Blaze 生態系統：Blaze / Dask / Odo。

Blaze（和Odo ）具有處理 MongoDB 的開箱即用功能。

一些有用的文章開始：

介紹 Blaze Expessions （使用 MongoDB 查詢示例）
ReproduceIt：Reddit 字數
Dask Arrays 和 Blaze 的區別

還有一篇文章展示了 Blaze 堆棧可能帶來的驚人事情：使用 Blaze 和 Impala 分析 17 億條 Reddit 評論（本質上，在幾秒鍾內查詢 975 Gb 的 Reddit 評論）。

PS 我不隸屬於任何這些技術。

Answer 8

使用

pandas.DataFrame(list(...))

如果迭代器/生成器結果很大，將消耗大量內存

最好生成小塊並在最后連接

def iterator2dataframes(iterator, chunk_size: int):
  """Turn an iterator into multiple small pandas.DataFrame

  This is a balance between memory and efficiency
  """
  records = []
  frames = []
  for i, record in enumerate(iterator):
    records.append(record)
    if i % chunk_size == chunk_size - 1:
      frames.append(pd.DataFrame(records))
      records = []
  if records:
    frames.append(pd.DataFrame(records))
  return pd.concat(frames)

Answer 9

http://docs.mongodb.org/manual/reference/mongoexport

導出到 csv 並使用read_csv或 JSON 並使用DataFrame.from_records()

Answer 10

遵循waitkuo的這個很好的答案，我想添加使用與.read_sql()和.read_csv() 一致的chunksize 來做到這一點的可能性。 我通過避免將“迭代器”/“光標”的每條“記錄”一一列出來擴大Deu Leung的答案。 我將借用之前的read_mongo函數。

def read_mongo(db, 
           collection, query={}, 
           host='localhost', port=27017, 
           username=None, password=None,
           chunksize = 100, no_id=True):
""" Read from Mongo and Store into DataFrame """


# Connect to MongoDB
#db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
client = MongoClient(host=host, port=port)
# Make a query to the specific DB and Collection
db_aux = client[db]


# Some variables to create the chunks
skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize))
if len(skips_variable)<=1:
    skips_variable = [0,len(skips_variable)]

# Iteration to create the dataframe in chunks.
for i in range(1,len(skips_variable)):

    # Expand the cursor and construct the DataFrame
    #df_aux =pd.DataFrame(list(cursor_aux[skips_variable[i-1]:skips_variable[i]]))
    df_aux =pd.DataFrame(list(db_aux[collection].find(query)[skips_variable[i-1]:skips_variable[i]]))

    if no_id:
        del df_aux['_id']

    # Concatenate the chunks into a unique df
    if 'df' not in locals():
        df =  df_aux
    else:
        df = pd.concat([df, df_aux], ignore_index=True)

return df

Answer 11

使用分頁的類似方法，如 Rafael Valero、waitkuo 和 Deu Leung：

def read_mongo(
       # db, 
       collection, query=None, 
       # host='localhost', port=27017, username=None, password=None,
       chunksize = 100, page_num=1, no_id=True):

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Calculate number of documents to skip
    skips = chunksize * (page_num - 1)

    # Sorry, this is in spanish
    # https://www.toptal.com/python/c%C3%B3digo-buggy-python-los-10-errores-m%C3%A1s-comunes-que-cometen-los-desarrolladores-python/es
    if not query:
        query = {}

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query).skip(skips).limit(chunksize)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

Answer 12

您可以使用pdmongo在三行中實現您想要的：

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [], "mongodb://localhost:27017/mydb")

如果您的數據非常大，您可以先通過過濾您不想要的數據來進行聚合查詢，然后將它們映射到您想要的列。

以下是將Readings.a映射到a列並按reportCount列過濾的示例：

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [{'$match': {'reportCount': {'$gt': 6}}}, {'$unwind': '$Readings'}, {'$project': {'a': '$Readings.a'}}], "mongodb://localhost:27017/mydb")

read_mongo接受與pymongo 聚合相同的參數

Answer 13

您還可以使用pymongoarrow——它是 MongoDB 提供的官方庫，用於將 mongodb 數據導出到 Pandas、numPy、parquet 文件等。

Answer 14

您可以使用“pandas.json_normalize”方法：

import pandas as pd
display(pd.json_normalize( x ))
display(pd.json_normalize( x , record_path="Readings" ))

它應該顯示兩個表，其中 x 是您的光標或：

from bson import ObjectId
def ISODate(st):
    return st

x = {
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
    "SensorReport"
],
"Readings" : [
    {
        "a" : 0.958069536790466,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
        "b" : 6.296118156595,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95574014778624,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
        "b" : 6.29651468650064,
        "_cls" : "Reading"
    },
    {
        "a" : 0.953648289182713,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
        "b" : 7.29679823731148,
        "_cls" : "Reading"
    },
    {
        "a" : 0.955931884300997,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
        "b" : 6.29642922525632,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95821381,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
        "b" : 7.28956613,
        "_cls" : "Reading"
    },
    {
        "a" : 4.95821335,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
        "b" : 6.28956574,
        "_cls" : "Reading"
    },
    {
        "a" : 9.95821341,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
        "b" : 0.28956488,
        "_cls" : "Reading"
    },
    {
        "a" : 1.95667927,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
        "b" : 0.29115237,
        "_cls" : "Reading"
    }
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}

Answer 15

在 shell 中啟動 mongo： mongosh
在 shell 上向上滾動，直到看到 mongo 連接到的位置。 它應該看起來像這樣： mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.5.4
將其復制並粘貼到 mongoclient
這是代碼：

from pymongo import MongoClient
import pandas as pd

client = MongoClient('mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.5.4')

mydatabase = client.yourdatabasename
mycollection = mydatabase.yourcollectionname
cursor = mycollection.find()
listofDocuments = list(cursor)
df = pd.DataFrame(listofDocuments)
df

如何從 mongodb 導入數據到 pandas？

問題描述

15 個解決方案

解決方案1
150 已采納 2013-04-27 18:45:56

解決方案2
48 2014-12-23 09:15:23

解決方案3
24 2013-12-19 22:33:39

解決方案4
21 2016-10-23 11:43:19

解決方案5
14 2016-10-20 23:33:06

解決方案6
13 2018-03-29 08:57:17

解決方案7
9 2016-09-27 00:16:38

解決方案8
5 2016-09-12 08:19:15

解決方案9
3 2013-04-27 11:32:35

解決方案10
1 2018-03-06 10:43:53

解決方案11
1 2018-03-20 01:19:26

解決方案12
1 2020-08-05 20:19:10

解決方案13
0 2021-05-11 16:57:03

解決方案14
0 2021-11-07 03:02:47

解決方案15
0 2022-08-24 19:48:52

如何從 mongodb 導入數據到 pandas？

問題描述

15 個解決方案

解決方案1 150 已采納 2013-04-27 18:45:56

解決方案2 48 2014-12-23 09:15:23

解決方案3 24 2013-12-19 22:33:39

解決方案4 21 2016-10-23 11:43:19

解決方案5 14 2016-10-20 23:33:06

解決方案6 13 2018-03-29 08:57:17

解決方案7 9 2016-09-27 00:16:38

解決方案8 5 2016-09-12 08:19:15

解決方案9 3 2013-04-27 11:32:35

解決方案10 1 2018-03-06 10:43:53

解決方案11 1 2018-03-20 01:19:26

解決方案12 1 2020-08-05 20:19:10

解決方案13 0 2021-05-11 16:57:03

解決方案14 0 2021-11-07 03:02:47

解決方案15 0 2022-08-24 19:48:52

解決方案1
150 已采納 2013-04-27 18:45:56

解決方案2
48 2014-12-23 09:15:23

解決方案3
24 2013-12-19 22:33:39

解決方案4
21 2016-10-23 11:43:19

解決方案5
14 2016-10-20 23:33:06

解決方案6
13 2018-03-29 08:57:17

解決方案7
9 2016-09-27 00:16:38

解決方案8
5 2016-09-12 08:19:15

解決方案9
3 2013-04-27 11:32:35

解決方案10
1 2018-03-06 10:43:53

解決方案11
1 2018-03-20 01:19:26

解決方案12
1 2020-08-05 20:19:10

解決方案13
0 2021-05-11 16:57:03

解決方案14
0 2021-11-07 03:02:47

解決方案15
0 2022-08-24 19:48:52