简体   繁体   中英

How do I optimize regex search in python?

################################################################
#
# Reddit sentiment script for comments in /r/wallstreetbets
# Loads comments from db, scans comments using word list,
# creates sentiment score and writes results to database
#
################################################################

import sqlite3
import re
from datetime import datetime


# Load key search words from text file db
def read_words_from_file(ftext_file):
    load_words = ()
    f = open(ftext_file, 'r')
    # for item in f:
    #    load_words.append(item.lower().strip())
    load_words = tuple([x.lower().strip() for x in f])
    return(load_words)


# Search key words against comments
# Gives +/- for positive or negative sentiment words
def word_search(db_list, fpos_flist, fneg_flist):
    pos_count = 0
    neg_count = 0
    fdb_results_list = []
    total_lines_scanned = 0
    print("Starting word search...")

    # 1st for loop is comment data
    # 2nd for loop is key words
    for comment in db_list:
        total_lines_scanned = total_lines_scanned + 1
        for pos_item in fpos_flist:
            word_search = re.findall(r"\b" + pos_item + r"\b", comment[0])
            pos_count = pos_count + len(word_search)

        for neg_item in fneg_flist:
            word_search = re.findall(r"\b" + neg_item + r"\b", comment[0])
            neg_count = neg_count + len(word_search)

        # Determine pos/neg sentiment score based on frequency in comment
        if pos_count > neg_count:
            pos_count = pos_count / (pos_count+neg_count)
            neg_count = 0
        elif pos_count < neg_count:
            neg_count = neg_count / (pos_count+neg_count)
            pos_count = 0
        elif pos_count == neg_count:
            pos_count = 0
            neg_count = 0

        if pos_count > 0 or neg_count > 0:
            fdb_results_list.append([pos_count, neg_count, comment[1]])
        if total_lines_scanned % 100000 == 0:
            print("Lines counted so far:", total_lines_scanned)
        pos_count = 0
        neg_count = 0

    print("Word search complete.")
    return(fdb_results_list)


# Write results to new DB. Deletes odd db.
# pos = item[0], neg = item[1], timestamp = item[2]
def write_db_results(write_db_list):
    print("Writing results to database...")
    conn = sqlite3.connect('testdb.sqlite', timeout=30)
    cur = conn.cursor()

    cur.executescript('''DROP TABLE IF EXISTS redditresultstable
    ''')

    cur.executescript('''
    CREATE TABLE redditresultstable (
        id INTEGER NOT NULL PRIMARY KEY UNIQUE,
        pos_count INTEGER,
        neg_count INTEGER,
        timestamp TEXT
    );
    ''')
    for item in write_db_list:
        cur.execute('''INSERT INTO redditresultstable (pos_count, neg_count, timestamp)
                    VALUES (?, ?, ?)''', (item[0], item[1], item[2]))

    conn.commit()
    conn.close()
    print("Writing results to database complete.")


# Load comments item[2] and timestamp item[4] from db
def load_db_comments():
    print("Loading database...")
    conn = sqlite3.connect('redditusertrack.sqlite')
    cur = conn.cursor()
    cur.execute('SELECT * FROM redditcomments')
    row_db = cur.fetchall()
    conn.close()
    print("Loading complete.")
    db_list = ()

    db_list = tuple([(item[2].lower(), item[4]) for item in row_db])
    return db_list


# Main Program Starts Here
print(datetime.now())

db_list = load_db_comments()

pos_word_list = read_words_from_file("simple_positive_words.txt")
neg_word_list = read_words_from_file("simple_negative_words.txt")

db_results_list = word_search(db_list, pos_word_list, neg_word_list)
db_results_list = tuple(db_results_list)

write_db_results(db_results_list)

print(datetime.now())

This script loads 1.3 million comments into memory from SQLite and then scans 147 keywords against each comment to then calculate a sentiment score. ~ 191 million iterations.

Execution takes 5 minutes and 32 second

I changed most of the variables to tuples (from lists) and used list comprehension instead of For Loops (for appending). This improved execution by about 5% when compared to the script when only using lists & For Loops to append. 5% could be a margin of error since my method of measuring may not be accurate.

Stackoverflow and other resources seemed to suggest that using tuples was faster for this type of iteration even though some posters provided evidence saying that in some situations lists were faster.

Is this code optimized correctly for using tuples and list comprehension?

edit: thank you all for the suggestions/comments. lots of work to do. I implemented @YuriyP 's suggestion and the runtime went from 5+ minutes to 26 seconds. The issue was with the regex For Loop search function.

updated code in the image attached. I removed the red crossed-out code and updated it with green.

修改后的代码

Use Regular expression to get total positive and negative word count from commaent instead of making O(N+M) requests for each positive and Negative word instead you will go for O(1).

Example:


    pos_word_list = read_words_from_file("simple_positive_words.txt")
    db_list = load_db_comments()
    posWordsEx = "|".join(pos_word_list)
    pos_words = 0
    for comment in db_list:
        words=re.findall(posWordsEx, comment)
        pos_words+=len(words)

Yo should prolly use cProfile to profile your code better measurements. And yes, tuples are tiny bit faster than list because you need 2 blocks of memory for the list and 1 for tuples. And, list comprehensions are also tiny bit faster but only if you have very simple expressions in your loops.

for total_lines_scanned, comment in enumerate(db_list, start=1):
tuple((item[2].lower(), item[4]) for item in row_db) -> you don't need list comprehensions here.

There are couple of things, you can do in your code like use enumerator for instead of bare for loop, and replace your comprehensions with simple generators.

I think profiling your code and trying to make it faster for such simple logic might not offer any advantages. Also, instead of using datetime use something like a profile decorator that will latch into your function and give you report of your calls.



import cProfile, pstats, io



def profile(fnc):
    
    """A decorator that uses cProfile to profile a function"""
    
    def inner(*args, **kwargs):
        
        pr = cProfile.Profile()
        pr.enable()
        retval = fnc(*args, **kwargs)
        pr.disable()
        s = io.StringIO()
        sortby = 'cumulative'
        ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
        ps.print_stats()
        print(s.getvalue())
        return retval

    return inner

Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM