简体   繁体   English

使用python将数据从mongodb导出到csv

[英]export data to csv from mongodb by using python

I am having problems with export to csv by using python script.我在使用 python 脚本导出到 csv 时遇到问题。 some array data need to be exported to CSV from Mongodb, but the following script did not export properly because three subfield data are dumped into a column.一些数组数据需要从 Mongodb 导出到 CSV,但是下面的脚本没有正确导出,因为三个子字段数据被转储到一个列中。 I want to separate three fields(order, text,answerid) under answers field into three different columns in a CSV.我想将答案字段下的三个字段(order、text、answerid)分成 CSV 中的三个不同列。

the sample of Mongodb: Mongodb的示例:

"answers": [
        {
            "order": 0,
            "text": {
                "en": "Yes"
            },
            "answerId": "527d65de7563dd0fb98fa28c"
        },
        {
            "order": 1,
            "text": {
                "en": "No"
            },
            "answerId": "527d65de7563dd0fb98fa28b"
        }
    ]

the python script:蟒蛇脚本:

import csv
cursor = db.questions.find ({},{'_id':1, 'answers.order':1, 'answers.text':1, 'answers.answerId':1})
cursor = list(cursor)
with open('answer_2.csv', 'w') as outfile:   

    fields = ['_id','answers.order', 'answers.text', 'answers.answerid']
    write = csv.DictWriter(outfile, fieldnames=fields)
    write.writeheader()
    for x in cursor: 
        for y, v in x.iteritems():
            if y == 'answers'
                print (y, v)
                write.writerow(v)
                write.writerow(x)

So... The problem is that the csv writer doesn't understand the concept of "subdictionaries" as mongo returns it.所以......问题是csv作者不理解mongo返回的“subdictionaries”的概念。

If I understood correctly, when you query Mongo, you get a dictionary like this:如果我理解正确的话,当你查询 Mongo 时,你会得到一个这样的字典:

{
   "_id": "a hex ID that correspond with the record that contains several answers",
   "answers": [ ... a list with a bunch of dicts in it... ]
}

So when the csv.DictWriter tries to write that, it is only writing one dictionary (the topmost).因此,当csv.DictWriter尝试编写它时,它只编写了一本字典(最顶层)。 It doesn't know (or cares) that answers is a list that contains dictionaries whose values need to be written in columns as well (accessing fields in dictionaries using the dot notation such as answers.order is only understood by Mongo, not by the csv writer)它不知道(或关心) answers是一个包含字典的列表,其值也​​需要写在列中(使用点符号访问字典中的字段,例如answers.order只能被 Mongo 理解,而不是被csv 作家)

What I understand you should do is "walk" the list of answers and create one dictionary out of each record (each dictionary) in that list.我理解您应该做的是“遍历”答案列表,并从该列表中的每个记录(每个词典)中创建一个词典。 Once you have a list of "flattened" dictionaries you can pass those and write them in your csv file:一旦你有了“扁平化”字典的列表,你就可以传递它们并将它们写入你的csv文件中:

cursor = client.stack_overflow.stack_039.find(
    {}, {'_id': 1, 'answers.order': 1, 'answers.text': 1, 'answers.answerId': 1})

# Step 1: Create the list of dictionaries (one dictionary per entry in the `answers` list)
flattened_records = []
for answers_record in cursor:
    answers_record_id = answers_record['_id']
    for answer_record in answers_record['answers']:
        flattened_record = {
            '_id': answers_record_id,
            'answers.order': answer_record['order'],
            'answers.text': answer_record['text'],
            'answers.answerId': answer_record['answerId']
        }
        flattened_records.append(flattened_record)

# Step 2: Iterate through the list of flattened records and write them to the csv file
with open('stack_039.csv', 'w') as outfile:
    fields = ['_id', 'answers.order', 'answers.text', 'answers.answerId']
    write = csv.DictWriter(outfile, fieldnames=fields)
    write.writeheader()
    for flattened_record in flattened_records:
        write.writerow(flattened_record)

Whatch for the use of plurals. Whatch 使用复数。 answers_record is different than answer_record answers_recordanswer_record不同

That creates a file like this:这会创建一个这样的文件:

$ cat ./stack_039.csv
_id,answers.order,answers.text,answers.answerId
580f9aa82de54705a2520833,0,{u'en': u'Yes'},527d65de7563dd0fb98fa28c
580f9aa82de54705a2520833,1,{u'en': u'No'},527d65de7563dd0fb98fa28b

EDIT:编辑:

Your query (the one that makes cursor = db.questions.find ({},{'_id':1, 'answers.order':1, 'answers.text':1, 'answers.answerId':1}) ) will return all the entries in the questions collection.您的查询(使cursor = db.questions.find ({},{'_id':1, 'answers.order':1, 'answers.text':1, 'answers.answerId':1}) ) 将返回questions集合中的所有条目。 If this collection is very large, you might want to use the cursor as an iterator .如果此集合非常大,您可能希望将cursor用作迭代器

As you might have already realized, the first for loop in my code above puts all the records in a list (the flattened_records list).您可能已经意识到,我上面代码中的第一个for循环将所有记录放在一个列表中( flattened_records列表)。 You can do lazy loading by iterating through the cursor (instead of loading all the items in memory, fetch one, do something with it, get the next, do something with it...).你可以通过遍历cursor来进行延迟加载(而不是加载内存中的所有项目,获取一个,用它做一些事情,得到下一个,用它做一些事情......)。

It's slightly slower, but more memory efficient.它稍微慢一点,但内存效率更高。

cursor = client.stack_overflow.stack_039.find(
    {}, {'_id': 1, 'answers.order': 1, 'answers.text': 1, 'answers.answerId': 1})

with open('stack_039.csv', 'w') as outfile:
    fields = ['_id', 'answers.order', 'answers.text', 'answers.answerId']
    write = csv.DictWriter(outfile, fieldnames=fields)
    write.writeheader()
    for answers_record in cursor:  # Here we are using 'cursor' as an iterator
        answers_record_id = answers_record['_id']
        for answer_record in answers_record['answers']:
            flattened_record = {
                '_id': answers_record_id,
                'answers.order': answer_record['order'],
                'answers.text': answer_record['text'],
                'answers.answerId': answer_record['answerId']
            }
            write.writerow(flattened_record)

It will produce the same .csv file as shown above.它将生成与上图相同的.csv文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM