繁体   English   中英

为什么 if not in(x,y) 在 python 中根本不起作用

[英]why does if not in(x,y) not work at all in python

我想 select 单词只有当我的列的每一行中的单词不是停用词而不是字符串标点符号时。

这是我在标记和删除停用词后的数据,我还想在删除停用词的同时删除标点符号。 见第二个 usf 后面有逗号。 我想到if word not in (stopwords,string.punctuation)因为它not in stopwords and not in string.punctuation我从这里看到它,但它导致无法删除停用词和标点符号。 如何解决这个问题?

data['text'].head(5)
Out[38]: 
0    ['ve, searching, right, words, thank, breather...
1    [free, entry, 2, wkly, comp, win, fa, cup, fin...
2    [nah, n't, think, goes, usf, ,, lives, around,...
3    [even, brother, like, speak, ., treat, like, a...
4                                 [date, sunday, !, !]
Name: text, dtype: object
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

data = pd.read_csv(r"D:/python projects/read_files/SMSSpamCollection.tsv",
                    sep='\t', header=None)

data.columns = ['label','text']

stopwords = set(stopwords.words('english'))

def process(df):
    data = word_tokenize(df.lower())
    data = [word for word in data if word not in (stopwords,string.punctuation)]
    return data

data['text'] = data['text'].apply(process)

那么你需要改变

data = [word for word in data if word not in (stopwords,string.punctuation)]

data = [word for word in data if word not in stopwords and word not in string.punctuation]

如果您仍想在一个if语句中执行此操作,您可以将string.punctuation转换为一个集合,并将其与stopwordsOR操作结合起来。 这就是它的样子:

data = [word for word in data if word not in (stopwords|set(string.punctuation))]

在 function 过程中,您必须将类型(字符串)转换为 pandas.core.series.Series 并使用 concat

function 将是:

' 定义进程(df):

  data = word_tokenize(df.lower())

  data = [word for word in data if word not in 
  pd.concat([stopwords,pd.Series(string.punctuation)])  ]

  return data

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM