大型（ish）数据集上的django数据库操作缓慢。

Question

I set up a system to filter the twitter real time stream sample. 我建立了一个过滤Twitter实时流样本的系统。 Obviously, the database writes are too slow to keep up with anything more complex than a couple of low-volume keywords. 显然，数据库写入速度太慢，无法跟上比几个低容量关键字复杂的事情。 I implemented django-rq as a simple queuing system to push the tweets off into a redis based queue as they came in, and that works great. 我将django-rq实现为一个简单的排队系统，以在推文进入时将其推入基于Redis的队列中，并且效果很好。 My issue is on the other side. 我的问题在另一侧。 The context to this question is I have a system that's running right now, with 1.5m tweets for analysis, and another 375,000 queued through redis. 这个问题的背景是我有一个正在运行的系统，其中有150万条tweet用于分析，另有375,000条通过redis排队。 At current rates of performance, it's going to take me ~3 days to catch up, if I turn off the streams, which I don't want to. 以目前的表现速度，如果我不希望关闭流，那么我将需要大约3天的时间才能赶上。 If I maintain the streams, then it'll take about a month, on my last estimates. 如果我保持流，那么根据我的最新估计，大约需要一个月的时间。

The database has a couple of million rows across two main tables now, and the writes are very slow. 该数据库现在在两个主表上有几百万行，并且写入速度非常慢。 The optimal number of rq-workers seems to be four, and that's averaging out at 1.6 queue tasks per second. rq-worker的最佳数量似乎是4，平均每秒可以处理1.6个队列任务。 (Code of what's being enqueued below). （下面排队的代码）。 I thought that maybe the issue was the opening of DB connections for every new queue task, so put CONN_MAX_AGE to 60, but that hasn't improved anything. 我认为问题可能在于为每个新队列任务打开了数据库连接，因此将CONN_MAX_AGE设置为60，但这并没有任何改善。

Having just tested this on localhost, I got in excess of 13 writes/second, on a Macbook 2011, with Chrome, etc etc running, but there are only a few thousand rows in that database, which leads me to believe it's size related. 刚刚在localhost上进行了测试，我在Macbook 2011上运行了Chrome等，每秒写入速度超过13次，但是该数据库中只有几千行，这使我相信它的大小与之相关。 There are a couple of get_or_create commands I'm using (see below), which could be slowing things down, but can't see any other way through using them - I need to check if the user exists, and I need to check if the tweet already exists (I could possibly, I suspect, move the latter to a try/except, on the basis that tweets coming in from the live stream shouldn't already exist, for obvious reasons.) Would I get much performance gain out of that? 我正在使用几个get_or_create命令（请参见下文），这可能会减慢速度，但看不到使用它们的任何其他方式-我需要检查用户是否存在，并且需要检查是否该推文已经存在（出于明显的原因，我可能会怀疑将直播流引入的推文不存在，我可能会将其移至try / except。）我是否可以获得很多性能提升其中？ As this is running still, I'm keen to optimise the code a bit and get some faster/more efficient workers in there so I can catch up! 由于此程序仍在运行，因此我渴望对代码进行一些优化，并在其中找到一些更快/更高效的工作人员，以便赶上来！ Would running a pre-vetting worker to batch things up work? 是否会聘请一名预审工作者来分批处理工作？ (ie so I can batch create users that don't exist, or something similar?) （即，这样我可以批量创建不存在的用户或类似的用户？）

I"m running a 4 Core/8Gb Ram droplet on digital ocean, so feel this is some pretty terrible performance, and presumably code related. Where am I going wrong here? 我正在数字海洋上运行4核/ 8Gb Ram液滴，因此觉得这是一个非常糟糕的性能，大概与代码有关。我在哪里出错了？
(I've posted this here rather than code-review, as I think this is relevant to the Q&A format for SO, as I'm trying to solve a specific code problem, rather than 'how can I do this generally better?') （我将其发布在这里而不是进行代码审查，因为我认为这与SO的问与答格式有关，因为我正在尝试解决特定的代码问题，而不是“我通常如何做得更好？” ）

Note: I'm working in django 1.6 as this is code that I've had floating around for a while and wasn't confident about upgrading at the time - it's not public facing, so unless there's a compelling reason right now (like this performance issue), I wasn't going to upgrade (for this project). 注意：我正在django 1.6中工作，因为这是我已经浮动了一段时间的代码，并且当时对升级没有信心-它不是面向公众的，所以除非现在有令人信服的理由（像这样）性能问题），我不会升级（针对该项目）。

Stream Listener: 流监听器：

class StdOutListener(tweepy.StreamListener):
            def on_data(self, data):
                # Twitter returns data in JSON format - we need to decode it first
                decoded = json.loads(data)
                #print type(decoded), decoded
                # Also, we convert UTF-8 to ASCII ignoring all bad characters sent by users
                try:
                    if decoded['lang'] == 'en':
                        django_rq.enqueue(read_both, decoded)
                    else:
                        pass
                except KeyError,e:
                    print "Error on Key", e
                except DataError, e:
                    print "DataError", e
                return True

            def on_error(self, status):
                print status

Read User/Tweet/Both 阅读用户/推文/两者

def read_user(tweet):
    from harvester.models import User
    from django.core.exceptions import ObjectDoesNotExist, MultipleObjectsReturned
    #We might get weird results where user has changed their details"], so first we check the UID.
    #print "MULTIPLE USER DEBUG", tweet["user"]["id_str"]
    try:
        current_user = User.objects.get(id_str=tweet["user"]["id_str"])
        created=False
        return current_user, created
    except ObjectDoesNotExist:
        pass
    except MultipleObjectsReturned:
        current_user = User.objects.filter(id_str=tweet["user"]["id_str"])[0]
        return current_user, False
    if not tweet["user"]["follow_request_sent"]:
        tweet["user"]["follow_request_sent"] = False
    if not tweet["user"]["following"]:
        tweet["user"]["following"] = False
    if not tweet["user"]["description"]:
        tweet["user"]["description"] = " "
    if not tweet["user"]["notifications"]:
        tweet["user"]["notifications"] = False

    #If that doesn't work"], then we'll use get_or_create (as a failback rather than save())
    from dateutil.parser import parse
    if not tweet["user"]["contributors_enabled"]:
        current_user, created = User.objects.get_or_create(
            follow_request_sent=tweet["user"]["follow_request_sent"],
            _json = {},
            verified = tweet["user"]["verified"],
            followers_count = tweet["user"]["followers_count"],
            profile_image_url_https = tweet["user"]["profile_image_url_https"],
            id_str = tweet["user"]["id_str"],
            listed_count = tweet["user"]["listed_count"],
            utc_offset = tweet["user"]["utc_offset"],
            statuses_count = tweet["user"]["statuses_count"],
            description = tweet["user"]["description"],
            friends_count = tweet["user"]["friends_count"],
            location = tweet["user"]["location"],
            profile_image_url= tweet["user"]["profile_image_url"],
            following = tweet["user"]["following"],
            geo_enabled = tweet["user"]["geo_enabled"],
            profile_background_image_url =tweet["user"]["profile_background_image_url"],
            screen_name = tweet["user"]["screen_name"],
            lang =  tweet["user"]["lang"],
            profile_background_tile = tweet["user"]["profile_background_tile"],
            favourites_count = tweet["user"]["favourites_count"],
            name = tweet["user"]["name"],
            notifications = tweet["user"]["notifications"],
            url = tweet["user"]["url"],
            created_at = parse(tweet["user"]["created_at"]),
            contributors_enabled = False,
            time_zone = tweet["user"]["time_zone"],
            protected = tweet["user"]["protected"],
            default_profile = tweet["user"]["default_profile"],
            is_translator = tweet["user"]["is_translator"]
        )
    else:
        current_user, created = User.objects.get_or_create(
            follow_request_sent=tweet["user"]["follow_request_sent"],
            _json = {},
            verified = tweet["user"]["verified"],
            followers_count = tweet["user"]["followers_count"],
            profile_image_url_https = tweet["user"]["profile_image_url_https"],
            id_str = tweet["user"]["id_str"],
            listed_count = tweet["user"]["listed_count"],
            utc_offset = tweet["user"]["utc_offset"],
            statuses_count = tweet["user"]["statuses_count"],
            description = tweet["user"]["description"],
            friends_count = tweet["user"]["friends_count"],
            location = tweet["user"]["location"],
            profile_image_url= tweet["user"]["profile_image_url"],
            following = tweet["user"]["following"],
            geo_enabled = tweet["user"]["geo_enabled"],
            profile_background_image_url =tweet["user"]["profile_background_image_url"],
            screen_name = tweet["user"]["screen_name"],
            lang =  tweet["user"]["lang"],
            profile_background_tile = tweet["user"]["profile_background_tile"],
            favourites_count = tweet["user"]["favourites_count"],
            name = tweet["user"]["name"],
            notifications = tweet["user"]["notifications"],
            url = tweet["user"]["url"],
            created_at = parse(tweet["user"]["created_at"]),
            contributors_enabled = tweet["user"]["contributers_enabled"],
            time_zone = tweet["user"]["time_zone"],
            protected = tweet["user"]["protected"],
            default_profile = tweet["user"]["default_profile"],
            is_translator = tweet["user"]["is_translator"]
        )
    #print "CURRENT USER:""], type(current_user)"], current_user
    #current_user"], created = User.objects.get_or_create(current_user)
    return current_user, created

def read_tweet(tweet, current_user):
    import logging
    logger = logging.getLogger('django')
    from datetime import date, datetime
    #print "Inside read_Tweet"
    from harvester.models import Tweet
    from django.core.exceptions import ObjectDoesNotExist, MultipleObjectsReturned
    from django.db import DataError
    #We might get weird results where user has changed their details"], so first we check the UID.
    #print tweet_data["created_at"]
    from dateutil.parser import parse
    tweet["created_at"] = parse(tweet["created_at"])
    try:
        #print "trying tweet_data["id"
        current_tweet =Tweet.objects.get(id_str=tweet["id_str"])
        created=False
        return current_user, created
    except ObjectDoesNotExist:
        pass
    except MultipleObjectsReturned:
        current_tweet =Tweet.objects.filter(id_str=tweet["id_str"])[0]
    try:
        current_tweet, created = Tweet.objects.get_or_create(
        truncated=tweet["truncated"],
        text=tweet["text"],
        favorite_count=tweet["favorite_count"],
        author = current_user,
        _json = {},
        source=tweet["source"],
        retweeted=tweet["retweeted"],
        coordinates = tweet["coordinates"],
        entities = tweet["entities"],
        in_reply_to_screen_name = tweet["in_reply_to_screen_name"],
        id_str = tweet["id_str"],
        retweet_count = tweet["retweet_count"],
        favorited = tweet["favorited"],
        user = tweet["user"],
        geo = tweet["geo"],
        in_reply_to_user_id_str = tweet["in_reply_to_user_id_str"],
        lang = tweet["lang"],
        created_at = tweet["created_at"],
        place = tweet["place"])
        print "DEBUG", current_user, current_tweet
        return current_tweet, created
    except DataError, e:
        #Catchall to pick up non-parsed tweets
        print "DEBUG ERROR", e, tweet
        return None, False

def read_both(tweet):
    current_user, created = read_user(tweet)
    current_tweet, created = read_tweet(tweet, current_user)

Answer 1

I eventually managed to cobble together an answer from some redditors and a couple of other things. 最终，我设法从一些编辑人员和其他一些事情中得到了答案。

Fundamentally, though I was doing a double lookup on the id_str field, which wasn't indexed. 从根本上讲，尽管我在id_str字段上进行了两次查找，但未编制索引。 I added indexes db_index=True to that field on both read_tweet and read_user , and moved read tweet to a try/except Tweet.objects.create approach, falling back to the get_or_create if there's a problem, and saw a 50-60x speed improvement, with the workers now being scalable - if I add 10 workers, I get 10x speed. 我在read_tweet和read_user上的该字段上都添加了索引db_index=True ，并将读取的tweet移至try / Tweet.objects.create方法之外，如果有问题则退回到get_or_create，并且看到速度提高了50-60倍，现在工作人员具有可伸缩性-如果我增加10个工作人员，则速度将提高10倍。

I currently have one worker that's happily processing 6 or so tweets a second. 我目前有一个工人愉快地处理6秒钟左右的推文。 Next up I'll add a monitoring daemon to check the queue size and add extra workers if it's still increasing. 接下来，我将添加一个监视守护程序以检查队列大小，并在队列仍在增加时添加额外的工作程序。

tl;dr - REMEMBER INDEXING! tl; dr-记住索引！

大型（ish）数据集上的django数据库操作缓慢。

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-06-30 10:47:04

大型（ish）数据集上的django数据库操作缓慢。

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-06-30 10:47:04

解决方案1
1 已采纳 2015-06-30 10:47:04