[英]How do I optimize regex search in python?
################################################################
#
# Reddit sentiment script for comments in /r/wallstreetbets
# Loads comments from db, scans comments using word list,
# creates sentiment score and writes results to database
#
################################################################
import sqlite3
import re
from datetime import datetime
# Load key search words from text file db
def read_words_from_file(ftext_file):
load_words = ()
f = open(ftext_file, 'r')
# for item in f:
# load_words.append(item.lower().strip())
load_words = tuple([x.lower().strip() for x in f])
return(load_words)
# Search key words against comments
# Gives +/- for positive or negative sentiment words
def word_search(db_list, fpos_flist, fneg_flist):
pos_count = 0
neg_count = 0
fdb_results_list = []
total_lines_scanned = 0
print("Starting word search...")
# 1st for loop is comment data
# 2nd for loop is key words
for comment in db_list:
total_lines_scanned = total_lines_scanned + 1
for pos_item in fpos_flist:
word_search = re.findall(r"\b" + pos_item + r"\b", comment[0])
pos_count = pos_count + len(word_search)
for neg_item in fneg_flist:
word_search = re.findall(r"\b" + neg_item + r"\b", comment[0])
neg_count = neg_count + len(word_search)
# Determine pos/neg sentiment score based on frequency in comment
if pos_count > neg_count:
pos_count = pos_count / (pos_count+neg_count)
neg_count = 0
elif pos_count < neg_count:
neg_count = neg_count / (pos_count+neg_count)
pos_count = 0
elif pos_count == neg_count:
pos_count = 0
neg_count = 0
if pos_count > 0 or neg_count > 0:
fdb_results_list.append([pos_count, neg_count, comment[1]])
if total_lines_scanned % 100000 == 0:
print("Lines counted so far:", total_lines_scanned)
pos_count = 0
neg_count = 0
print("Word search complete.")
return(fdb_results_list)
# Write results to new DB. Deletes odd db.
# pos = item[0], neg = item[1], timestamp = item[2]
def write_db_results(write_db_list):
print("Writing results to database...")
conn = sqlite3.connect('testdb.sqlite', timeout=30)
cur = conn.cursor()
cur.executescript('''DROP TABLE IF EXISTS redditresultstable
''')
cur.executescript('''
CREATE TABLE redditresultstable (
id INTEGER NOT NULL PRIMARY KEY UNIQUE,
pos_count INTEGER,
neg_count INTEGER,
timestamp TEXT
);
''')
for item in write_db_list:
cur.execute('''INSERT INTO redditresultstable (pos_count, neg_count, timestamp)
VALUES (?, ?, ?)''', (item[0], item[1], item[2]))
conn.commit()
conn.close()
print("Writing results to database complete.")
# Load comments item[2] and timestamp item[4] from db
def load_db_comments():
print("Loading database...")
conn = sqlite3.connect('redditusertrack.sqlite')
cur = conn.cursor()
cur.execute('SELECT * FROM redditcomments')
row_db = cur.fetchall()
conn.close()
print("Loading complete.")
db_list = ()
db_list = tuple([(item[2].lower(), item[4]) for item in row_db])
return db_list
# Main Program Starts Here
print(datetime.now())
db_list = load_db_comments()
pos_word_list = read_words_from_file("simple_positive_words.txt")
neg_word_list = read_words_from_file("simple_negative_words.txt")
db_results_list = word_search(db_list, pos_word_list, neg_word_list)
db_results_list = tuple(db_results_list)
write_db_results(db_results_list)
print(datetime.now())
This script loads 1.3 million comments into memory from SQLite and then scans 147 keywords against each comment to then calculate a sentiment score.该脚本将 130 万条评论从 SQLite 加载到 memory 中,然后针对每条评论扫描 147 个关键字,然后计算情绪得分。 ~ 191 million iterations.
~ 1.91 亿次迭代。
Execution takes 5 minutes and 32 second执行需要 5 分 32 秒
I changed most of the variables to tuples (from lists) and used list comprehension instead of For Loops (for appending).我将大部分变量更改为元组(来自列表)并使用列表理解而不是 For 循环(用于追加)。 This improved execution by about 5% when compared to the script when only using lists & For Loops to append.
与仅使用列表和 For 循环到 append 的脚本相比,这将执行提高了约 5%。 5% could be a margin of error since my method of measuring may not be accurate.
5% 可能是一个误差范围,因为我的测量方法可能不准确。
Stackoverflow and other resources seemed to suggest that using tuples was faster for this type of iteration even though some posters provided evidence saying that in some situations lists were faster. Stackoverflow 和其他资源似乎表明,在这种类型的迭代中使用元组更快,尽管一些海报提供的证据表明在某些情况下列表更快。
Is this code optimized correctly for using tuples and list comprehension?此代码是否针对使用元组和列表理解进行了正确优化?
edit: thank you all for the suggestions/comments.编辑:谢谢大家的建议/意见。 lots of work to do.
很多工作要做。 I implemented @YuriyP 's suggestion and the runtime went from 5+ minutes to 26 seconds.
我实施了@YuriyP的建议,运行时间从 5 分钟以上变为 26 秒。 The issue was with the regex For Loop search function.
问题在于正则表达式 For Loop 搜索 function。
updated code in the image attached.附加图像中的更新代码。 I removed the red crossed-out code and updated it with green.
我删除了红色划掉的代码并用绿色更新了它。
Use Regular expression to get total positive and negative word count from commaent instead of making O(N+M) requests for each positive and Negative word instead you will go for O(1).使用正则表达式从评论中获取总的正负字数,而不是对每个正负字进行 O(N+M) 请求,而是将 go 用于 O(1)。
pos_word_list = read_words_from_file("simple_positive_words.txt")
db_list = load_db_comments()
posWordsEx = "|".join(pos_word_list)
pos_words = 0
for comment in db_list:
words=re.findall(posWordsEx, comment)
pos_words+=len(words)
Yo should prolly use cProfile to profile your code better measurements.哟应该 prolly 使用 cProfile 来分析您的代码更好的测量。 And yes, tuples are tiny bit faster than list because you need 2 blocks of memory for the list and 1 for tuples.
是的,元组比列表快一点,因为列表需要 2 个 memory 块,元组需要 1 个块。 And, list comprehensions are also tiny bit faster but only if you have very simple expressions in your loops.
而且,列表推导也稍微快一点,但前提是你的循环中有非常简单的表达式。
for total_lines_scanned, comment in enumerate(db_list, start=1):
tuple((item[2].lower(), item[4]) for item in row_db) -> you don't need list comprehensions here.
There are couple of things, you can do in your code like use enumerator for instead of bare for loop, and replace your comprehensions with simple generators.有几件事,你可以在你的代码中做一些事情,比如使用枚举器而不是裸 for 循环,并用简单的生成器替换你的理解。
I think profiling your code and trying to make it faster for such simple logic might not offer any advantages.我认为分析您的代码并尝试使其更快地处理这种简单的逻辑可能不会提供任何优势。 Also, instead of using
datetime
use something like a profile decorator that will latch into your function and give you report of your calls.此外,不要使用
datetime
时间,而是使用类似于配置文件装饰器的东西,它会锁定到您的 function 并为您报告您的呼叫。
import cProfile, pstats, io
def profile(fnc):
"""A decorator that uses cProfile to profile a function"""
def inner(*args, **kwargs):
pr = cProfile.Profile()
pr.enable()
retval = fnc(*args, **kwargs)
pr.disable()
s = io.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print(s.getvalue())
return retval
return inner
Good luck!祝你好运!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.