简体   繁体   English

使用Python中的列表搜索大文件-如何提高速度?

[英]Seaching big files using list in Python - How can improve the speed?

I have a folder with 300+ .txt files with total size of 15GB+. 我有一个包含300 + .txt文件的文件夹,总大小为15GB +。 These files contain tweets. 这些文件包含推文。 Each line is a different tweet. 每行都是不同的推文。 I have a list of keywords I'd like to search the tweets for. 我有一个清单,我想搜索这些推文。 I have created a script that searches each line of every file for every item on my list. 我创建了一个脚本,用于搜索列表中每个项目的每个文件的每一行。 If the tweet contains the keyword, then it writes the line into another file. 如果tweet包含关键字,那么它将把该行写入另一个文件。 This is my code: 这是我的代码:

# Search each file for every item in keywords
print("Searching the files of " + filename + " for the appropriate keywords...")
for file in os.listdir(file_path):
    f = open(file_path + file, 'r')
    for line in f:
        for key in keywords:
            if re.search(key, line, re.IGNORECASE):
                db.write(line)

This is the format each line has: 这是每行具有的格式:

{"created_at":"Wed Feb 03 06:53:42 +0000 2016","id":694775753754316801,"id_str":"694775753754316801","text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF","source":"\u003ca href=\"http:\/\/www.facebook.com\/twitter\" rel=\"nofollow\"\u003eFacebook\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":5981342,"id_str":"5981342","name":"Lava Kafle","screen_name":"lkafle","location":"Kathmandu, Nepal","url":"http:\/\/about.me\/lavakafle","description":"@deerwalkinc 24000+ tweeps bigdata  #Team #Genomics  http:\/\/deerwalk.com #Genetic #Testing #population #health #management #BigData #Analytics #java #hadoop","protected":false,"verified":false,"followers_count":24742,"friends_count":23169,"listed_count":1481,"favourites_count":147252,"statuses_count":171880,"created_at":"Sat May 12 04:49:14 +0000 2007","utc_offset":20700,"time_zone":"Kathmandu","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_link_color":"088253","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/5981342\/1416802075","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/MqmDwbCDAF","expanded_url":"http:\/\/fb.me\/Yj1JW9bJ","display_url":"fb.me\/Yj1JW9bJ","indices":[45,68]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1454482422661"}

The script works but it takes a lot of time . 该脚本可以工作,但是需要很多时间 For ~40 keywords it needs more than 2 hours. 大约40个关键字需要2个小时以上。 Obviously my code is not optimized. 显然我的代码没有优化。 What can I do to improve the speed? 我该如何提高速度?

ps I have read some relevant questions regarding searching and speed but I suspect that the problem in my script lies in the fact that I'm using a list for the keywords. ps我已经阅读了有关搜索和速度的一些相关问题,但是我怀疑脚本中的问题在于我使用的是关键字列表。 I've tried some of the suggested solutions but to no avail. 我尝试了一些建议的解决方案,但无济于事。

1) External library 1)外部图书馆

If you're willing to lean on external libraries (and time to execute is more important than the one-off time cost to install), you might be able to gain some speed by loading each file into a simple Pandas DataFrame and performing the keyword search as a vector operation. 如果您愿意依靠外部库(并且执行时间比一次性安装的时间成本更为重要),则可以通过将每个文件加载到简单的Pandas DataFrame中并执行关键字来提高速度。搜索为向量运算。 To get the matching tweets, you would do something like: 要获取匹配的推文,您可以执行以下操作:

import pandas as pd
dataframe_from_text = pd.read_csv("/path/to/file.txt")
matched_tweets_index =  dataframe_from_text.str.match("keyword_a|keyword_b")
dataframe_from_text[matched_tweets_index] # Uses the boolean search above to filter the full dataframe
# You'd then have a mini dataframe of matching tweets in `dataframe_from_text`. 
# You could loop through these to save them out to a file using the `.to_dict(orient="records")` format.

Dataframe operations within Pandas can be really quick so might be worth investigating. 熊猫内部的数据框操作可以非常快速,因此值得进行调查。

2) Group your regex 2)将您的正则表达式分组

Looks like you're not logging which keyword you matched against. 看起来您没有在记录与之匹配的关键字。 If this is true, you could group your keywords into a single regex query like so: 如果是这样,则可以将关键字分组为一个正则表达式查询,如下所示:

for line in f:
    keywords_combined = "|".join(keywords)
    if re.search(keywords_combined, line, re.IGNORECASE):
        db.write(line)

I've not tested this but by reducing the number of loops per line, that could trim some time off. 我没有对此进行测试,但是通过减少每行的循环数,可以节省一些时间。

Why it's slow 为什么慢

You are regex searching through a json dump, which is not always a good idea. 您正在通过json转储进行正则表达式搜索,但这并不总是一个好主意。 For example, if you keywords include words like user, time, profile and image each line will result in a match because the json format for tweets has all these terms as dictionary keys. 例如,如果您的关键字包含诸如user,time,profile和image之类的词,则每行都会匹配,因为tweets的json格式将所有这些术语作为字典关键字。

Besides the raw JSON is huge, each tweet will be more than 1kb in size (this one is 2.1kb) but the only part that's relevent in your sample is: 除了原始JSON巨大之外,每条推文的大小都将超过1kb(此为2.1kb),但是样本中唯一相关的部分是:

"text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF",

And this is less than 100 bytes, a typical tweet is still less than 140 characters despite recent changes to the API. 而且它少于100个字节,尽管最近对API进行了更改,但典型的推文仍然少于140个字符。

Things to try: 尝试的事情:

pre compile the regex as suggested by Padraic Cunningham 按照Padraic Cunningham的建议预编译正则表达式

Option 1. Load this data into a postgresql JSONB field. 选项1.将数据加载到postgresql JSONB字段中。 JSONB fields are indexable and can be searched very quickly JSONB字段是可索引的,可以非常快速地进行搜索

Option 2. Load this into any old database, with the context of the text field having it's own column so that this column can be searched easily. 选项2。将其加载到任何旧数据库中,并且文本字段的上下文具有其自己的列,以便可以轻松搜索此列。

Option 3. last but not least, extract just the text field into it's own file. 选项3.最后但并非最不重要的一点是,仅将text字段提取到其自己的文件中。 You can have a CSV file where the first column is the screen name and the second is the text of the tweet. 您可以拥有一个CSV文件,其中第一列是屏幕名称,第二列是推文的文本。 Your 15GB will be shrunk to about 1GB 您的15GB将会缩小到约1GB

In short what you are doing now is searching the whole farm for the needle when you only need to search the haystack. 简而言之,当您只需要搜索干草堆时,您现在要在整个农场中搜索针叶树。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM