为什么 if not in(x,y) 在 python 中根本不起作用

Question

我想 select 单词只有当我的列的每一行中的单词不是停用词而不是字符串标点符号时。

这是我在标记和删除停用词后的数据，我还想在删除停用词的同时删除标点符号。 见第二个 usf 后面有逗号。 我想到if word not in (stopwords,string.punctuation)因为它not in stopwords and not in string.punctuation我从这里看到它，但它导致无法删除停用词和标点符号。 如何解决这个问题？

data['text'].head(5)
Out[38]: 
0    ['ve, searching, right, words, thank, breather...
1    [free, entry, 2, wkly, comp, win, fa, cup, fin...
2    [nah, n't, think, goes, usf, ,, lives, around,...
3    [even, brother, like, speak, ., treat, like, a...
4                                 [date, sunday, !, !]
Name: text, dtype: object

import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

data = pd.read_csv(r"D:/python projects/read_files/SMSSpamCollection.tsv",
                    sep='\t', header=None)

data.columns = ['label','text']

stopwords = set(stopwords.words('english'))

def process(df):
    data = word_tokenize(df.lower())
    data = [word for word in data if word not in (stopwords,string.punctuation)]
    return data

data['text'] = data['text'].apply(process)

Answer 1

那么你需要改变

data = [word for word in data if word not in (stopwords,string.punctuation)]

至

data = [word for word in data if word not in stopwords and word not in string.punctuation]

Answer 2

如果您仍想在一个if语句中执行此操作，您可以将string.punctuation转换为一个集合，并将其与stopwords与OR操作结合起来。 这就是它的样子：

data = [word for word in data if word not in (stopwords|set(string.punctuation))]

Answer 3

在 function 过程中，您必须将类型（字符串）转换为 pandas.core.series.Series 并使用 concat

function 将是：

' 定义进程（df）：

  data = word_tokenize(df.lower())

  data = [word for word in data if word not in 
  pd.concat([stopwords,pd.Series(string.punctuation)])  ]

  return data

为什么 if not in(x,y) 在 python 中根本不起作用

问题描述

3 个解决方案

解决方案1
1 2020-05-21 16:06:13

解决方案2
1 已采纳 2020-05-21 16:09:29

解决方案3
1 2020-05-21 16:27:38

为什么 if not in(x,y) 在 python 中根本不起作用

问题描述

3 个解决方案

解决方案1 1 2020-05-21 16:06:13

解决方案2 1 已采纳 2020-05-21 16:09:29

解决方案3 1 2020-05-21 16:27:38

解决方案1
1 2020-05-21 16:06:13

解决方案2
1 已采纳 2020-05-21 16:09:29

解决方案3
1 2020-05-21 16:27:38