简体   繁体   English

如何优化预处理所有文本文档而不使用for循环在每次迭代中预处理单个文本文档?

[英]How to optimize preprocess all text documents without using for loop to preprocess a single text document in each iteration?

I want to optimize the below code so that it can process the 3000 text data efficiently and that data then will be fed to TFIDF Vectorizer and linkage() for clustering.我想优化下面的代码,以便它可以有效地处理 3000 个文本数据,然后这些数据将被馈送到 TFIDF Vectorizer 和 links() 进行聚类。

So far, I have read the excel using pandas and saved the dataframe into list variable.到目前为止,我已经使用 Pandas 读取了 excel 并将数据框保存到列表变量中。 Then I iterated on the list by each text element in the list into tokens and then filtering out the stopwords from the element.然后我将列表中的每个文本元素迭代为标记,然后从元素中过滤掉停用词。 The filtered element is stored into another variable and that variable is stored in the list.过滤后的元素存储在另一个变量中,该变量存储在列表中。 So at the end, I created a list of processed text elements(from list).所以最后,我创建了一个处理过的文本元素列表(来自列表)。

I think that optimization can be performed when a list is created and when stopwords are filtered out and also while saving the data into two different variables: documents_no_stopwords and processed_words.我认为可以在创建列表和过滤掉停用词时以及将数据保存到两个不同的变量中时执行优化:documents_no_stopwords 和 processing_words。

It would be great if someone can help me on this or suggest me a direction to follow.如果有人可以帮助我或建议我遵循的方向,那就太好了。

temp=0
df=pandas.read_excel('File.xlsx')

for text in df['text'].tolist():
    temp=temp+1
    preprocessing(text)
    print temp


def preprocessing(word):

    tokens = tokenizer.tokenize(word)

    processed_words = []
    for w in tokens:
        if w in stop_words:
            continue
        else:
    ## a new list is created with only the nouns in them for each text document
            processed_words.append(w)
    ## This step creates a list of text documents with only the nouns in them
    documents_no_stopwords.append(' '.join(processed_words))
    processed_words=[]

You need to first make set of stop words and use list comprehension to filter the tokens.您需要首先制作一set停用词并使用列表理解来过滤标记。

def preprocessing(txt):
    tokens = word_tokenize(txt)
    # print(tokens)
    stop_words = set(stopwords.words("english"))
    tokens = [i for i in tokens if i not in stop_words]

    return " ".join(tokens)

string = "Hey this is Sam. How are you?"
print(preprocessing(string))

Output:输出:

'Hey Sam . How ?'

And rather than using a for loop, use df.apply like below:而不是使用for循环,使用df.apply如下所示:

df['text'] = df['text'].apply(preprocessing)

Why sets are preferred over list为什么集合优于列表

There are duplicate entries in stopwords.words() If you check len(stopwords.words()) and len(set(stopwords.words())) the length of set is smaller by few hundreds. stopwords.words()存在重复条目 如果您检查len(stopwords.words())len(set(stopwords.words())) ,则 set 的长度会小几百。 That's why set is preferred here.这就是为什么这里首选set

Here's the difference between performance using list and set这是使用listset性能差异

x = stopwords.words('english')
y = set(stopwords.words('english'))

%timeit new = [i for i in tokens if i not in x]
# 10000 loops, best of 3: 120 µs per loop

%timeit old = [j for j in tokens if j not in y]
# 1000000 loops, best of 3: 1.16 µs per loop

And furthermore list-comprehension is faster than normal for-loop .此外, list-comprehension比普通的for-loop更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM