简体   繁体   English

从 python 的列表中删除自定义单词

[英]custom word removal from a list in python

I am writing a function to do custom word removal, stemming (getting the root form of the word) and then tf-idf.我正在编写 function 来执行自定义单词删除、词干提取(获取单词的根形式),然后是 tf-idf。

My input data to the function is a list.我对 function 的输入数据是一个列表。 If I try to do custom word removal on individual list, that works, but when I combine it in the function, I get an attribute error:如果我尝试在单个列表上进行自定义单词删除,那是可行的,但是当我将它组合到 function 中时,我得到一个属性错误:

AttributeError: 'list' object has no attribute 'lower' AttributeError: 'list' object 没有属性 'lower'

Here is my code:这是我的代码:

def tfidf_kw(K):    
    # Select docs in cluster K
    docs = np.array(mydata2)[km_r3.labels_==K]

    ps= PorterStemmer()
    stem_docs = []
    for doc in docs:
        keep_tokens = []
        
        for token in doc.split(' '):
            #custom stopword removal
            my_list = ['model', 'models', 'modeling', 'modelling', 'python', 
           'train','training', 'trains', 'trained','test','testing', 'tests','tested']
            
            token  = [sub_token for sub_token in list(doc) if sub_token not in my_list]

            stem_token=ps.stem(token)
            keep_tokens.append(stem_token)

        keep_tokens =' '.join(keep_tokens)
        stem_docs.append(keep_tokens)

        return(keep_tokens)

Further code is for tf-idf, which works.进一步的代码适用于 tf-idf,它有效。 This is where I need help, to understand what am I doing wrong?这是我需要帮助的地方,以了解我做错了什么?

token  = [sub_token for sub_token in list(doc) if sub_token not in my_list]

Here is the complete error:这是完整的错误:

AttributeError  Traceback (most recent call last)
<ipython-input-154-528a540678b0> in <module>
     49     #return(sorted_df)
     50 
---> 51 tfidf_kw(0)

<ipython-input-154-528a540678b0> in tfidf_kw(K)
     20 
     21 
---> 22             stem_token=ps.stem(token)
     23             keep_tokens.append(stem_token)
     24 

~/opt/anaconda3/lib/python3.8/site-packages/nltk/stem/porter.py in stem(self, word)
    650 
    651     def stem(self, word):
--> 652         stem = word.lower()
    653 
    654         if self.mode == self.NLTK_EXTENSIONS and word in self.pool:

AttributeError: 'list' object has no attribute 'lower'

On line 51, where it says tfidf_kw(0) , that's where I am checking the function for k=0.在第 51 行,它tfidf_kw(0) ,这就是我检查 function 的 k=0 的地方。

Apparently the ps.stem method expects a single word (a string) as argument, but you are passing a list of strings.显然ps.stem方法需要一个单词(一个字符串)作为参数,但您传递的是一个字符串列表。

Since you are already inside a for token in doc.split(' ') loop it does not seem to make sense to me to use a list comprehension [... for sub_token in list(doc)...] additionally.由于您已经for token in doc.split(' ')使用列表理解[... for sub_token in list(doc)...]对我来说似乎没有意义。

If your goal is to skip those tokens that are in my_list , presumably you want to write the for token in doc.split(' ') loop like this:如果您的目标是跳过my_list中的那些标记,大概您想for token in doc.split(' ')如下所示:

for token in doc.split(' '):
    my_list = ['model', 'models', 'modeling', 'modelling', 'python', 
   'train','training', 'trains', 'trained','test','testing', 'tests','tested']

    if token in my_list:
        continue
    
    stem_token=ps.stem(token)
    keep_tokens.append(stem_token)

Here, if token is one of the words in my_list , the continue statement skips the rest of the current iteration and the loop continues with the next token .在这里,如果tokenmy_list中的单词之一,则continue语句会跳过当前迭代的 rest 并且循环继续下一个token

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM