简体   繁体   English

这需要很长时间……我如何加快这本词典的速度? (Python)

[英]This takes a long time…how do I speed this dictionary up? (python)

    meta_map = {}
    results = db.meta.find({'corpus_id':id, 'method':method}) #this Mongo query only takes 3ms
    print results.explain()
    #result is mongo queryset of 2000 documents

    count = 0
    for r in results:
        count += 1
        print count
        word = r.get('word')
        data = r.get('data',{})
        if not meta_map.has_key(word):
            meta_map[word] = data
    return meta_map

This is super, super slow for some reason.由于某种原因,这超级超级慢。

There are a total of 2000 results.总共有 2000 个结果。 Below is an example of a result document (from Mongo).下面是一个result文档的示例(来自 Mongo)。 All other results are similar in length.所有其他结果的长度相似。

{ "word" : "articl", "data" : { "help" : 0.42454812322341984, "show" : 0.24099054286865948, "lack" : 0.2368313038407821, "steve" : 0.20491936823259457, "gb" : 0.18757527934987422, "feedback" : 0.2855335862138559, "categori" : 0.28210549642632016, "itun" : 0.23615623082085788, "articl" : 0.21378509220044106, "black" : 0.22720575131038662, "hidden" : 0.26172127252557625, "holiday" : 0.27662433827306804, "applic" : 0.1802411089325281, "digit" : 0.20491936823259457, "sourc" : 0.21909218369809863, "march" : 0.2632736571995878, "ceo" : 0.2153108869289692, "donat" : 1, "volum" : 0.2572042432755638, "octob" : 0.2802470156773559, "toolbox" : 0.2153108869289692, "discuss" : 0.26973295489368615, "list" : 0.3698592948408095, "upload" : 0.1802411089325281, "random" : 1, "default" : 0.33044754314072383, "februari" : 0.2899936154686609, "januari" : 0.25228424754983525, "septemb" : 0.1802411089325281, "page" : 0.24675067183234803, "view" : 0.20019523259334138, "pleas" : 0.2839965947961194, "mdi" : 0.2731217555354, "unsourc" : 0.2709524603813144, "direct" : 0.18757527934987422, "dead" : 0.22720575131038662, "smartphon" : 0.2839965947961194, "jump" : 0.3004203939398161, "see" : 0.33044754314072383, "design" : 0.2839965947961194, "download" : 0.19574598998663462, "home" : 0.3004203939398161, "event" : 0.651573574681647, "wikipedia" : 0.21909218369809863, "content" : 0.2471475889083912, "version" : 0.42454812322341984, "gener" : 0.3004203939398161, "refer" : 0.2188507485718582, "navig" : 0.27662433827306804, "june" : 0.2153108869289692, "screen" : 0.27662433827306804, "free" : 0.22720575131038662, "job" : 0.19574598998663462, "key" : 0.3004203939398161, "addit" : 0.22484486630589545, "search" : 0.2878804276884952, "current" : 0.5071530767683105, "worldwid" : 0.20491936823259457, "iphon" : 0.2230524329516571, "action" : 0.24099054286865948, "chang" : 0.18757527934987422, "summari" : 0.33044754314072383, "origin" : 0.2572042432755638, "softwar" : 0.651573574681647, "point" : 0.27662433827306804, "extern" : 0.22190187748860113, "mobil" : 0.2514880028687207, "cloud" : 0.18757527934987422, "use" : 0.2731217555354, "log" : 0.27662433827306804, "commun" : 0.33044754314072383, "interact" : 0.5071530767683105, "devic" : 0.3004203939398161, "long" : 0.2839965947961194, "avail" : 0.19574598998663462, "appl" : 0.24099054286865948, "disambigu" : 0.3195885490528538, "statement" : 0.2737499468972353, "namespac" : 0.3004203939398161, "season" : 0.3004203939398161, "juli" : 0.27243508666247285, "relat" : 0.19574598998663462, "phone" : 0.26973295489368615, "link" : 0.2178125232318433, "line" : 0.42454812322341984, "pilot" : 0.27243508666247285, "account" : 0.2572042432755638, "main" : 0.34870313981256423, "provid" : 0.2153108869289692, "histori" : 0.2714135089366041, "vagu" : 0.24875213214603717, "featur" : 0.24099054286865948, "creat" : 0.26645207330844684, "ipod" : 0.2230524329516571, "player" : 0.20491936823259457, "io" : 0.2447908314834019, "need" : 0.2580912994161046, "develop" : 0.27662433827306804, "began" : 0.24099054286865948, "client" : 0.19574598998663462, "also" : 0.42454812322341984, "cleanup" : 0.24875213214603717, "split" : 0.26973295489368615, "tool" : 0.2878804276884952, "product" : 0.42454812322341984, "may" : 0.2676701118192027, "assist" : 0.1802411089325281, "variant" : 0.2514880028687207, "portal" : 0.3004203939398161, "user" : 0.20491936823259457, "consid" : 0.27662433827306804, "date" : 0.2731217555354, "recent" : 0.24099054286865948, "read" : 0.2572042432755638, "reliabl" : 0.2388872270166464, "sale" : 0.22720575131038662, "ambigu" : 0.23482106920048526, "person" : 0.260801274024785, "contact" : 0.24099054286865948, "encyclopedia" : 0.2153108869289692, "time" : 0.2368313038407821, "model" : 0.24099054286865948, "audio" : 0.19574598998663462 }}

The whole process takings about 15 seconds ...what the hell?整个过程大约需要15 秒……什么鬼? How can I speed it up?我怎样才能加快速度? :) :)

Edit: I realize that when I print the count in console, it goes from 0 to 101 very fast, and then freezes for 10 seconds, and then continues from 102 to 2000编辑:我意识到当我在控制台中打印计数时,它会非常快地从 0 变为 101,然后冻结 10 秒,然后从 102 继续到 2000

could this be a MongoDB problem?这可能是 MongoDB 问题吗?

Edit 2: I printed the Mongo EXPLAIN() of the query below:编辑 2:我打印了下面查询的 Mongo EXPLAIN():

{u'allPlans': [{u'cursor': u'BtreeCursor corpus_id_1_method_1_word_1',
                u'indexBounds': {u'corpus_id': [[u'iphone', u'iphone']],
                                 u'method': [[u'advanced', u'advanced']],
                                 u'word': [[{u'$minElement': 1},
                                            {u'$maxElement': 1}]]}}],
 u'cursor': u'BtreeCursor corpus_id_1_method_1_word_1',
 u'indexBounds': {u'corpus_id': [[u'iphone', u'iphone']],
                  u'method': [[u'advanced', u'advanced']],
                  u'word': [[{u'$minElement': 1}, {u'$maxElement': 1}]]},
 u'indexOnly': False,
 u'isMultiKey': False,
 u'millis': 3,
 u'n': 2443,
 u'nChunkSkips': 0,
 u'nYields': 0,
 u'nscanned': 2443,
 u'nscannedObjects': 2443,
 u'oldPlan': {u'cursor': u'BtreeCursor corpus_id_1_method_1_word_1',
              u'indexBounds': {u'corpus_id': [[u'iphone', u'iphone']],
                               u'method': [[u'advanced', u'advanced']],
                               u'word': [[{u'$minElement': 1},
                                          {u'$maxElement': 1}]]}}}

These are the stats for the mongo collection:这些是 mongo 集合的统计信息:

> db.meta.stats();
{
    "ns" : "inception.meta",
    "count" : 2450,
    "size" : 3001068,
    "avgObjSize" : 1224.9257142857143,
    "storageSize" : 18520320,
    "numExtents" : 6,
    "nindexes" : 2,
    "lastExtentSize" : 13893632,
    "paddingFactor" : 1.009999999999931,
    "flags" : 1,
    "totalIndexSize" : 368640,
    "indexSizes" : {
        "_id_" : 114688,
        "corpus_id_1_method_1_word_1" : 253952
    },
    "ok" : 1
}


> db.meta.getIndexes();
[
    {
        "name" : "_id_",
        "ns" : "inception.meta",
        "key" : {
            "_id" : 1
        },
        "v" : 0
    },
    {
        "ns" : "inception.meta",
        "name" : "corpus_id_1_method_1_word_1",
        "key" : {
            "corpus_id" : 1,
            "method" : 1,
            "word" : 1
        },
        "v" : 0
    }
]

Instead of代替

if not meta_map.has_key(word):

you should use你应该使用

if word not in meta_map:

There is no point in doing data = r.get('data',{}) if you are not going to use it.如果你不打算使用它,那么做data = r.get('data',{})是没有意义的。

It's not obvious why you are doing word = r.get('word') ... if 'word' always exists in r , you should just use word = r['word'] ;你为什么要做word = r.get('word') word = r['word'] r otherwise you should test whether word is None after the get.否则你应该在get之后测试word是否为None

Likewise the data get.同样得到数据。

Try this:尝试这个:

for r in results:
    word = r['word']
    if word not in meta_map:
         meta_map[word] = r['data']

In any case the time you quoted is enormous... there must be something else going on there.无论如何,您引用的时间是巨大的……那里肯定有其他事情发生。 I would be very interested to see your code for doing the timing and counting the number of entries in results .我很想看到您的代码用于计时和计算results中的条目数。

Your query is returning almost all the documents in your collection (which may or may not be correct in this case; good database advice is always to transmit as few documents/rows as possible from the server to your application), and your collection is about 3 megabytes in size.您的查询正在返回您集合中的几乎所有文档(在这种情况下可能正确也可能不正确;好的数据库建议始终将尽可能少的文档/行从服务器传输到您的应用程序),并且您的集合大约是大小为 3 兆字节。 It's possible that the delay you are seeing is simply due to the network transmission time.您看到的延迟可能仅仅是由于网络传输时间。

If your problem really is the dictionary, maybe using setdefault() instead of first looking the key up and then setting it can help.如果您的问题确实是字典,也许使用setdefault()而不是首先查找密钥然后设置它会有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM