简体   繁体   English

有没有更快的方法来预处理 Python 中的大量文本数据?

[英]Is there a faster way to preprocess huge amount of text data in Python?

I'm building sentiment analyse algorithm to predict the score of IMDb reviews.我正在构建情绪分析算法来预测 IMDb 评论的分数。 I wanted to do it from scratch, so I scraped half a million reviews and created my own data set.我想从头开始,所以我收集了 50 万条评论并创建了自己的数据集。

I'm sending small review packages (consist of 50 reviews) to review_cleaner with pool.我正在将小型评论包(包含 50 条评论)发送到带有池的 review_cleaner。 It helped me to reduce run time from 40 minutes to 11 minutes for 1000 reviews.它帮助我将 1000 条评论的运行时间从 40 分钟减少到 11 分钟。 But, I have half a million reviews, I need faster way to process them.但是,我有 50 万条评论,我需要更快的方法来处理它们。 I was thinking if it's possible to run it on my GPU (GTX1060 6GB)?我在想是否可以在我的 GPU (GTX1060 6GB) 上运行它? I installed CUDA, but I couldn't find how to run specific function(review_cleaner) on GPU cores.我安装了 CUDA,但我找不到如何在 GPU 内核上运行特定功能(review_cleaner)。

Basically, what I need is, solution to run preprocess faster.基本上,我需要的是更快地运行预处理的解决方案。 I searched and tried many different things but couldn't do it.我搜索并尝试了许多不同的东西,但无法做到。 Is there any way to run it faster?有什么方法可以更快地运行它吗?

def filling_the_database(review_data): 
    try:
        c.executemany("""INSERT INTO preprocessed_reviews(review, review_score) VALUES (?, ?)""", review_data)
        conn.commit()
    except Error as e:
        print(e)


def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)


def review_cleaner(review):
    lemmatizer = WordNetLemmatizer()
    new_review_data = ()
    bulk_data = []
    for each in review:
        review_temp = ''.join([i for i in each[0] if not i.isdigit()])
        review_temp = REPLACE_NO_SPACE.sub(" ", review_temp.lower())
        review_temp = nltk.word_tokenize(review_temp)
        review_temp = (lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in review_temp)
        review_temp = ' '.join([word for word in review_temp if word not in stopwords.words('english')])
        new_review_data = (review_temp, each[1])
        bulk_data.append(new_review_data)
    filling_the_database(bulk_data)


if __name__ == "__main__":
    review_data = ()
    bulk_data = []
    amount_of_reviews = 0
    previous_amount = 0
    conn = create_connection('2020-04-11')
    c = conn.cursor()
    c.execute("""CREATE TABLE IF NOT EXISTS preprocessed_reviews(review TEXT, review_score INTEGER, ID PRIMARY KEY)""")
    conn.commit()
    total_number_of_reviews = c.execute("""SELECT COUNT(*) FROM movie_reviews""")
    for each in total_number_of_reviews:
        total_number_of_reviews = each[0]
    while amount_of_reviews < total_number_of_reviews:
        review_package = []
        amount_of_reviews += 50
        data = c.execute("""SELECT * FROM movie_reviews WHERE ID BETWEEN (?) AND (?)""", (previous_amount, amount_of_reviews-1))
        conn.commit()
        previous_amount = amount_of_reviews
        for each in data:
            review_data = (each[0], each[1])
            review_package.append(review_data)
            del review_data
        bulk_data.append(review_package)
        del review_package
        print(amount_of_reviews)
    p = Pool(4)
    p.map(review_cleaner, bulk_data)
    p.close()
    print('---- %s seconds ----' % (time.time() - start_time))

I'm storing around half a million (400k) reviews in SQLite database.我在 SQLite 数据库中存储了大约一百万(400k)条评论。 One column for review and one column for score of the review.一栏为评论,一栏为评论的分数。 In another table, I'm inserting the preprocessed reviews same way, one column for review and one column for score.在另一个表中,我以相同的方式插入预处理的评论,一列用于评论,一列用于评分。 I have 16 gigs of RAM, Intel i7 6700HQ, SSD and GTX1060 6GB.我有 16 GB 的 RAM、Intel i7 6700HQ、SSD 和 GTX1060 6GB。

A few thoughts crossed my mind.一些想法掠过我的脑海。

  1. Reading and writing from SQLite may have quite a lot of overhead when in reality you could fit 500k reviews in your 16GB of ram.从 SQLite 读取和写入可能会有相当多的开销,而实际上您可以在 16GB 内存中容纳 500k 条评论。 You could do this by dumping your data to a tabulated csv file and then reading it in with pandas to do the preprocessing.为此,您可以将数据转储到表格 csv 文件中,然后使用 pandas 将其读入以进行预处理。 You could also use pandarallel to parallelise the work instead of using pool to make your life easier.您还可以使用 pandarallel 来并行化工作,而不是使用 pool 来让您的生活更轻松。

  2. If SQLite is not the bottleneck then it's likely a computational bottleneck in which case I would look at running the process overnight, or hiring a cloud-compute instance with good cpu resources.如果 SQLite 不是瓶颈,那么它可能是计算瓶颈,在这种情况下,我会考虑在一夜之间运行该过程,或者租用具有良好 cpu 资源的云计算实例。 A 16core machine wouldn't be too expensive to rent on AWS for a short amount of time and that would give you a theoretical 4x speedup.一台 16 核的机器在 AWS 上租用很短的时间不会太贵,这会给你理论上的 4 倍加速。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM