简体   繁体   English

为什么这个列表理解只在 df.apply 中有效?

[英]Why does this list comprehension only work in df.apply?

I'm trying to remove stopwords in my data.我正在尝试删除数据中的停用词。 So it would go from this所以它会从这个 go

data['text'].head(5)
Out[25]: 
0    go until jurong point, crazy.. available only ...
1                        ok lar... joking wif u oni...
2    free entry in 2 a wkly comp to win fa cup fina...
3    u dun say so early hor... u c already then say...
4    nah i don't think he goes to usf, he lives aro...
Name: text, dtype: object

to this对此

data['newt'].head(5)
Out[26]: 
0    [go, jurong, point,, crazy.., available, bugis...
1                 [ok, lar..., joking, wif, u, oni...]
2    [free, entry, 2, wkly, comp, win, fa, cup, fin...
3    [u, dun, say, early, hor..., u, c, already, sa...
4      [nah, think, goes, usf,, lives, around, though]
Name: newt, dtype: object

I have two options on how to do this.关于如何做到这一点,我有两种选择。 I'm trying both options separately so it won't overwrite anything.我正在分别尝试这两个选项,所以它不会覆盖任何东西。 Firstly i'm applying a function to the data column.首先,我将 function 应用于数据列。 This works, it removes achieve what i wanted to do.这行得通,它消除了我想做的事情。

def process(data):
    data = data.lower()
    data = data.split()
    data = [row for row in data if row not in stopwords]
    return data

data['newt'] = data['text'].apply(process)

And second option in without using apply function parameter.第二个选项不使用应用 function 参数。 It's exactly like the function but why it returns TypeError: unhashable type: 'list' ?它与 function 完全相同,但为什么它返回TypeError: unhashable type: 'list' i check that if row not in stopwords in the line is what causing this because when i delete it, it runs but it doesn't do the stopwords removal我检查if row not in stopwords中的行是导致此问题的原因,因为当我删除它时,它会运行但它不会删除停用词

data['newt'] = data['text'].str.lower()
data['newt'] = data['newt'].str.split()
data['newt'] = [row for row in data['newt'] if row not in stopwords]

Your list comprehension fails because it checks if your entire dataframe row is in the stopwords list.您的列表理解失败,因为它会检查您的整个dataframe 行是否在停用词列表中。 This is never true, so what [row for row in data['newt'] if row not in stopwords] produces is simply the list of values in the original data['newt'] column.这绝不是真的,所以[row for row in data['newt'] if row not in stopwords]产生的只是原始data['newt']列中的值列表。

I think that following your logic, your last lines for stopwords removal may read我认为按照您的逻辑,您删除停用词的最后几行可能是

data['newt'] = data['text'].str.lower()
data['newt'] = data['newt'].str.split()
data['newt'] = [[word for word in row if word not in stopwords] for row in data['newt']]

If you are OK using apply , the last line can be replaced with如果您可以使用apply ,最后一行可以替换为

data['newt'] = data['newt'].apply(lambda row: [word for word in row if word not in stopwords])

Finally, you could also call最后,你也可以打电话

data['newt'].apply(lambda row: " ".join(row))

to get back strings at the end of the process.在流程结束时取回字符串。

Mind that str.split may not be the best way to do tokenization, and you may opt for solutions using a dedicated library like spacy using a combination of removing stop words using spacy and adding custom stopwords with Add/remove custom stop words with spacy请注意, str.split可能不是进行标记化的最佳方法,您可以选择使用像spacy这样的专用库的解决方案,结合使用 spacy 删除停用词和使用 spacy 添加/删除自定义停用词添加自定义停用词

To convince yourself of the above argument, try out the following code:要说服自己相信上述论点,请尝试以下代码:

import spacy

sent = "She said: 'beware, your sentences may contain a lot of funny chars!'"

# spacy tokenization
spacy.cli.download("en_core_web_sm")
nlp = spacy.load('en_core_web_sm')
doc = nlp(sent)
print([token.text for token in doc])

# simple split
print(sent.split())

and compare the two outputs.并比较两个输出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM