简体   繁体   English

创建一个包含单词词汇的语料库

[英]Create a Corpus Containing the Vocabulary of Words

I am calculating inverse_document_frequency for all the words in my documents dictionary and I have to show the top 5 documents ranked according to the score on queries. 我正在为文档字典中的所有单词计算inverse_document_frequency,并且必须显示根据查询得分排名最高的5个文档。 But I am stuck in loops while creating corpus containing the vocabulary of words in the documents. 但是我在创建包含文档中单词词汇的语料库时陷入了循环。 Please help me to improve my code. 请帮助我改善代码。 This Block of code used to read my files and removing punctuation and stop words from a file 此代码块用于读取我的文件以及从文件中删除标点符号和停用词

def wordList(doc):
"""
1: Remove Punctuation
2: Remove Stop Words
3: return List of Words
"""
file = open("C:\\Users\\Zed\\PycharmProjects\\ACL txt\\"+doc, 'r', encoding="utf8", errors='ignore')
text = file.read().strip()
file.close()
nopunc=[char for char in text if char not in punctuation]
nopunc=''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in english_stopwords]

This block of code is used to store all files name in my folder 此代码块用于将所有文件名存储在我的文件夹中

file_names=[]
for file in Path("ACL txt").rglob("*.txt"):
file_names.append(file.name)

This block of code used to create my dictionary of documents on which i am working 此代码块用于创建我正在使用的文档字典

documents = {}
for i in file_names:
documents[i]=wordList(i)

Above codes working good and fast but this block of code taking lot of time creating corpus how can i improve this 上面的代码工作良好且快速,但是此代码块需要花费大量时间来创建语料库,我该如何改善它

#create a corpus containing the vocabulary of words in the documents
corpus = [] # a list that will store words of the vocabulary
     for doc in documents.values(): #iterate through documents 
        for word in doc: #go through each word in the current doc
            if not word in corpus: 
                corpus.append(word) #add word in corpus if not already added

This code creates a dictionary that will store document frequency for each word in the corpus 这段代码创建了一个字典,用于存储语料库中每个单词的文档频率

df_corpus = {} #document frequency for every word in corpus
for word in corpus:
    k = 0 #initial document frequency set to 0
    for doc in documents.values(): #iterate through documents
        if word in doc.split(): #check if word in doc
            k+=1 
    df_corpus[word] = k

From 2 hours it creating corpus and still creating Please help me to improve my code. 从2个小时起,它将创建语料库,并且仍在创建中。请帮助我改善代码。 This is the data set I am working with https://drive.google.com/open?id=1D1GjN_JTGNBv9rPNcWJMeLB_viy9pCfJ 这是我正在使用的数据集https://drive.google.com/open?id=1D1GjN_JTGNBv9rPNcWJMeLB_viy9pCfJ

How about instead of list, setting corpus as a set type? 将语料库设置为集合类型而不是列表呢? you won't need additional if too. 如果过你将不再需要额外的。

corpus = set() # a list that will store words of the vocabulary
for doc in documents.values(): #iterate through documents 
    corpus.update(doc) #add word in corpus if not already added

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM