简体   繁体   English

优化:将JSON从流API转储到Mongo

[英]Optimization: Dumping JSON from a Streaming API to Mongo

Background: I have a python module set up to grab JSON objects from a streaming API and store them (bulk insert of 25 at a time) in MongoDB using pymongo. 背景:我设置了一个python模块,以从流API抓取JSON对象,并使用pymongo将它们(一次插入25个)批量存储在MongoDB中。 For comparison, I also have a bash command to curl from the same streaming API and pipe it to mongoimport . 为了进行比较,我还有一个bash命令从相同的流API进行curl ,并将其通过pipe传递给mongoimport Both these approaches store data in separate collections. 这两种方法都将数据存储在单独的集合中。

Periodically, I monitor the count() of the collections to check how they fare. 我会定期监视集合的count() ,以检查它们的运行情况。

So far, I see the python module lagging by about 1000 JSON objects behind the curl | mongoimport 到目前为止,我看到python模块在curl | mongoimport后面滞后了约1000个JSON对象curl | mongoimport curl | mongoimport approach. curl | mongoimport方法。

Problem: How can I optimize my python module to be ~ in sync with the curl | mongoimport 问题:如何优化我的python模块使其与curl | mongoimport同步? curl | mongoimport ? curl | mongoimport

I cannot use tweetstream since I am not using the Twitter API but a 3rd party streaming service. 我无法使用tweetstream因为我没有使用Twitter API,而是使用了第三方流媒体服务。

Could someone please help me out here? 有人可以帮我吗?

Python module: Python模块:


class StreamReader:
    def __init__(self):
        try:
            self.buff = ""
            self.tweet = ""
            self.chunk_count = 0
            self.tweet_list = []
            self.string_buffer = cStringIO.StringIO()
            self.mongo = pymongo.Connection(DB_HOST)
            self.db = self.mongo[DB_NAME]
            self.raw_tweets = self.db["raw_tweets_gnip"]
            self.conn = pycurl.Curl()
            self.conn.setopt(pycurl.ENCODING, 'gzip')
            self.conn.setopt(pycurl.URL, STREAM_URL)
            self.conn.setopt(pycurl.USERPWD, AUTH)
            self.conn.setopt(pycurl.WRITEFUNCTION, self.handle_data)
            self.conn.perform()
        except Exception as ex:
            print "error ocurred : %s" % str(ex)

    def handle_data(self, data):
        try:
            self.string_buffer = cStringIO.StringIO(data)
            for line in self.string_buffer:
                try:
                    self.tweet = json.loads(line)
                except Exception as json_ex:
                    print "JSON Exception occurred: %s" % str(json_ex)
                    continue

                if self.tweet:
                    try:
                        self.tweet_list.append(self.tweet)
                        self.chunk_count += 1
                        if self.chunk_count % 1000 == 0
                            self.raw_tweets.insert(self.tweet_list)
                            self.chunk_count = 0
                            self.tweet_list = []

                    except Exception as insert_ex:
                        print "Error inserting tweet: %s" % str(insert_ex)
                        continue
        except Exception as ex:
            print "Exception occurred: %s" % str(ex)
            print repr(self.buff)

    def __del__(self):
        self.string_buffer.close()

Thanks for reading. 谢谢阅读。

Originally there was a bug in your code. 最初,您的代码中存在一个错误。

                if self.chunk_count % 50 == 0
                    self.raw_tweets.insert(self.tweet_list)
                    self.chunk_count = 0

You reset the chunk_count but you don't reset the tweet_list. 您重置了chunk_count,但没有重置tweet_list。 So second time through you try to insert 100 items (50 new ones plus 50 that were already sent to DB the time before). 因此,第二次尝试插入100个项目(50个新项目加上之前已发送到DB的50个项目)。 You've fixed this, but still see a difference in performance. 您已解决此问题,但仍然看到性能差异。

The whole batch size thing turns out to be a red herring. 整个批次大小的东西竟然是红鲱鱼。 I tried using a large file of json and loading it via python vs. loading it via mongoimport and Python was always faster (even in safe mode - see below). 我尝试使用一个较大的json文件并通过python进行加载,而不是通过mongoimport进行加载,并且Python总是更快(即使在安全模式下-参见下文)。

Taking a closer look at your code, I realized the problem is with the fact that the streaming API is actually handing you data in chunks. 通过仔细查看您的代码,我意识到问题在于流API实际上将数据分块地处理。 You are expected to just take those chunks and put them into the database (that's what mongoimport is doing). 您只需要将这些块放入数据库中(这就是mongoimport所做的)。 The extra work your python is doing to split up the stream, add it to a list and then periodically send batches to Mongo is probably the difference between what I see and what you see. python要做的额外工作是拆分流,将其添加到列表中,然后定期将批处理发送到Mongo,这可能是我所看到的与所看到的之间的区别。

Try this snippet for your handle_data() 尝试为您的handle_data()使用此代码段

def handle_data(self, data):
    try:
        string_buffer = StringIO(data)
        tweets = json.load(string_buffer)
    except Exception as ex:
        print "Exception occurred: %s" % str(ex)
    try:
        self.raw_tweets.insert(tweets)
    except Exception as ex:
        print "Exception occurred: %s" % str(ex)

One thing to note is that your python inserts are not running in "safe mode" - you should change that by adding an argument safe=True to your insert statement. 需要注意的一件事是您的python插入未在“安全模式”下运行 -您应该通过在插入语句中添加一个参数safe=True来更改它。 You will then get an exception on any insert that fails and your try/catch will print the error exposing the problem. 然后,您将在任何插入失败的情况下得到一个异常,并且您的try / catch将打印出暴露该问题的错误。

It doesn't cost much in performance either - I'm currently running a test and after about five minutes, the sizes of two collections are 14120 14113. 它的性能也不会花费太多-我目前正在运行测试,大约五分钟后,两个集合的大小为14120 14113。

Got rid of the StringIO library. 摆脱了StringIO库。 As the WRITEFUNCTION callback handle_data , in this case, gets invoked for every line, just load the JSON directly. 在这种情况下,由于WRITEFUNCTION回调handle_data被每一行调用,只需直接加载JSON Sometimes, however, there could be two JSON objects contained in data. 但是,有时,数据中可能包含两个JSON对象。 I am sorry, I can't post the curl command that I use as it contains our credentials. 抱歉,我无法发布我使用的curl命令,因为它包含我们的凭据。 But, as I said, this is a general issue applicable to any streaming API. 但是,正如我所说,这是适用于所有流式API的普遍问题。


def handle_data(self, buf): 
    try:
        self.tweet = json.loads(buf)
    except Exception as json_ex:
        self.data_list = buf.split('\r\n')
        for data in self.data_list:
            self.tweet_list.append(json.loads(data))    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM