简体   繁体   English

如何优化 python 中的正则表达式搜索?

[英]How do I optimize regex search in python?

################################################################
#
# Reddit sentiment script for comments in /r/wallstreetbets
# Loads comments from db, scans comments using word list,
# creates sentiment score and writes results to database
#
################################################################

import sqlite3
import re
from datetime import datetime


# Load key search words from text file db
def read_words_from_file(ftext_file):
    load_words = ()
    f = open(ftext_file, 'r')
    # for item in f:
    #    load_words.append(item.lower().strip())
    load_words = tuple([x.lower().strip() for x in f])
    return(load_words)


# Search key words against comments
# Gives +/- for positive or negative sentiment words
def word_search(db_list, fpos_flist, fneg_flist):
    pos_count = 0
    neg_count = 0
    fdb_results_list = []
    total_lines_scanned = 0
    print("Starting word search...")

    # 1st for loop is comment data
    # 2nd for loop is key words
    for comment in db_list:
        total_lines_scanned = total_lines_scanned + 1
        for pos_item in fpos_flist:
            word_search = re.findall(r"\b" + pos_item + r"\b", comment[0])
            pos_count = pos_count + len(word_search)

        for neg_item in fneg_flist:
            word_search = re.findall(r"\b" + neg_item + r"\b", comment[0])
            neg_count = neg_count + len(word_search)

        # Determine pos/neg sentiment score based on frequency in comment
        if pos_count > neg_count:
            pos_count = pos_count / (pos_count+neg_count)
            neg_count = 0
        elif pos_count < neg_count:
            neg_count = neg_count / (pos_count+neg_count)
            pos_count = 0
        elif pos_count == neg_count:
            pos_count = 0
            neg_count = 0

        if pos_count > 0 or neg_count > 0:
            fdb_results_list.append([pos_count, neg_count, comment[1]])
        if total_lines_scanned % 100000 == 0:
            print("Lines counted so far:", total_lines_scanned)
        pos_count = 0
        neg_count = 0

    print("Word search complete.")
    return(fdb_results_list)


# Write results to new DB. Deletes odd db.
# pos = item[0], neg = item[1], timestamp = item[2]
def write_db_results(write_db_list):
    print("Writing results to database...")
    conn = sqlite3.connect('testdb.sqlite', timeout=30)
    cur = conn.cursor()

    cur.executescript('''DROP TABLE IF EXISTS redditresultstable
    ''')

    cur.executescript('''
    CREATE TABLE redditresultstable (
        id INTEGER NOT NULL PRIMARY KEY UNIQUE,
        pos_count INTEGER,
        neg_count INTEGER,
        timestamp TEXT
    );
    ''')
    for item in write_db_list:
        cur.execute('''INSERT INTO redditresultstable (pos_count, neg_count, timestamp)
                    VALUES (?, ?, ?)''', (item[0], item[1], item[2]))

    conn.commit()
    conn.close()
    print("Writing results to database complete.")


# Load comments item[2] and timestamp item[4] from db
def load_db_comments():
    print("Loading database...")
    conn = sqlite3.connect('redditusertrack.sqlite')
    cur = conn.cursor()
    cur.execute('SELECT * FROM redditcomments')
    row_db = cur.fetchall()
    conn.close()
    print("Loading complete.")
    db_list = ()

    db_list = tuple([(item[2].lower(), item[4]) for item in row_db])
    return db_list


# Main Program Starts Here
print(datetime.now())

db_list = load_db_comments()

pos_word_list = read_words_from_file("simple_positive_words.txt")
neg_word_list = read_words_from_file("simple_negative_words.txt")

db_results_list = word_search(db_list, pos_word_list, neg_word_list)
db_results_list = tuple(db_results_list)

write_db_results(db_results_list)

print(datetime.now())

This script loads 1.3 million comments into memory from SQLite and then scans 147 keywords against each comment to then calculate a sentiment score.该脚本将 130 万条评论从 SQLite 加载到 memory 中,然后针对每条评论扫描 147 个关键字,然后计算情绪得分。 ~ 191 million iterations. ~ 1.91 亿次迭代。

Execution takes 5 minutes and 32 second执行需要 5 分 32 秒

I changed most of the variables to tuples (from lists) and used list comprehension instead of For Loops (for appending).我将大部分变量更改为元组(来自列表)并使用列表理解而不是 For 循环(用于追加)。 This improved execution by about 5% when compared to the script when only using lists & For Loops to append.与仅使用列表和 For 循环到 append 的脚本相比,这将执行提高了约 5%。 5% could be a margin of error since my method of measuring may not be accurate. 5% 可能是一个误差范围,因为我的测量方法可能不准确。

Stackoverflow and other resources seemed to suggest that using tuples was faster for this type of iteration even though some posters provided evidence saying that in some situations lists were faster. Stackoverflow 和其他资源似乎表明,在这种类型的迭代中使用元组更快,尽管一些海报提供的证据表明在某些情况下列表更快。

Is this code optimized correctly for using tuples and list comprehension?此代码是否针对使用元组和列表理解进行了正确优化?

edit: thank you all for the suggestions/comments.编辑:谢谢大家的建议/意见。 lots of work to do.很多工作要做。 I implemented @YuriyP 's suggestion and the runtime went from 5+ minutes to 26 seconds.我实施了@YuriyP的建议,运行时间从 5 分钟以上变为 26 秒。 The issue was with the regex For Loop search function.问题在于正则表达式 For Loop 搜索 function。

updated code in the image attached.附加图像中的更新代码。 I removed the red crossed-out code and updated it with green.我删除了红色划掉的代码并用绿色更新了它。

修改后的代码

Use Regular expression to get total positive and negative word count from commaent instead of making O(N+M) requests for each positive and Negative word instead you will go for O(1).使用正则表达式从评论中获取总的正负字数,而不是对每个正负字进行 O(N+M) 请求,而是将 go 用于 O(1)。

Example:例子:


    pos_word_list = read_words_from_file("simple_positive_words.txt")
    db_list = load_db_comments()
    posWordsEx = "|".join(pos_word_list)
    pos_words = 0
    for comment in db_list:
        words=re.findall(posWordsEx, comment)
        pos_words+=len(words)

Yo should prolly use cProfile to profile your code better measurements.哟应该 prolly 使用 cProfile 来分析您的代码更好的测量。 And yes, tuples are tiny bit faster than list because you need 2 blocks of memory for the list and 1 for tuples.是的,元组比列表快一点,因为列表需要 2 个 memory 块,元组需要 1 个块。 And, list comprehensions are also tiny bit faster but only if you have very simple expressions in your loops.而且,列表推导也稍微快一点,但前提是你的循环中有非常简单的表达式。

for total_lines_scanned, comment in enumerate(db_list, start=1):
tuple((item[2].lower(), item[4]) for item in row_db) -> you don't need list comprehensions here.

There are couple of things, you can do in your code like use enumerator for instead of bare for loop, and replace your comprehensions with simple generators.有几件事,你可以在你的代码中做一些事情,比如使用枚举器而不是裸 for 循环,并用简单的生成器替换你的理解。

I think profiling your code and trying to make it faster for such simple logic might not offer any advantages.我认为分析您的代码并尝试使其更快地处理这种简单的逻辑可能不会提供任何优势。 Also, instead of using datetime use something like a profile decorator that will latch into your function and give you report of your calls.此外,不要使用datetime时间,而是使用类似于配置文件装饰器的东西,它会锁定到您的 function 并为您报告您的呼叫。



import cProfile, pstats, io



def profile(fnc):
    
    """A decorator that uses cProfile to profile a function"""
    
    def inner(*args, **kwargs):
        
        pr = cProfile.Profile()
        pr.enable()
        retval = fnc(*args, **kwargs)
        pr.disable()
        s = io.StringIO()
        sortby = 'cumulative'
        ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
        ps.print_stats()
        print(s.getvalue())
        return retval

    return inner

Good luck!祝你好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Python中搜索正则表达式匹配? - How do I search through regex matches in Python? python - unicode正则表达式匹配 - 如何搜索复选标记? ✓ - python - unicode regex match - how do I search for the checkmark? ✓ 我如何正则表达式搜索python值? “”“字符串”“”,“字符串”,“字符串”,(元组),[列表],{dict} - How do I regex search for python values? “”“string”“”, “string”, 'string', (tuple), [list], {dict} Python Regex:如何从文本末尾开始搜索和拉取? - Python Regex: How do I search and pull starting from the end of the text? 如何通过 python 中的正则表达式搜索获取完整字符串,该字符串仅捕获部分单词? - How do i get the full strings with a RegEx search in python that only captures part of the word? 如何在 python 中使用正则表达式在多个句子的段落中搜索模式? - How do I search for a pattern within a paragraph of multiple sentences using regex in python? 我如何在python中替换正则表达式? - How do I do this replace regex in python? 我如何在Python中执行此正则表达式? - How do I do this regex in Python? 如何为我的python正则表达式执行“或”操作? - How do I do an “OR” for my python regex? 如何优化 python 代码而不丢失结果? - How do I optimize python code without losing results?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM