为什么使用停用词或 nltk 语料库后有些英语单词会被删除？

Question

I am working with pyspark dataframes and I need to perform data cleaning on one of the columns as shown below:我正在使用 pyspark 数据帧，我需要对其中一列执行数据清理，如下所示：

df.select('words').show(10, truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                               words|
+----------------------------------------------------------------------------------------------------+
|[you, are, hereby, ordered, to, cease, and, desist, all, furthe, r, emails, to, this, address, im...|
|[content, type, text, plain, charset, utf, 8, content, transfer, encoding, quoted, printable, x, ...|
|[you, are, hereby, ordered, to, cease, and, desist, all, furthe, r, emails, to, this, address, im...|
|[, original, message, return, path, bounce, 19853e, 6fb54, visyak, 3djuno, com, cysticacneonchin,...|
|[, forwarded, message, return, pat, h, bounce, 19853e, 6fb54, visyak, 3djuno, com, cysticacneonch...|
|[, original, message, from, 248, 623, 1653, mailto, lisa, lahlahsales, com, 20, sent, tuesday, fe...|
|[2018, horse, trailer, closeouts, free, delivery, cash, back, click, here, to, view, it, online, ...|
|[, original, message, from, paypal, us, mailto, scottkahndmd, nc, rr, com, sent, 27, february, 20...|
|[2col, 1, 2, 09, client, specific, styles, outlook, a, padding, 0, force, outlook, to, provide, a...|
|[you, are, hereby, ordered, to, cease, and, desist, all, furthe, r, emails, to, this, address, im...|
+----------------------------------------------------------------------------------------------------+
only showing top 10 rows

I perform the following steps for data cleaning:我执行以下步骤进行数据清理：

remover = StopWordsRemover(inputCol='words', outputCol='words_clean') #remove stop-word
df = remover.transform(df)

df = df.withColumn("words_filtered", F.expr("filter(words_clean, x -> not(length(x) < 3))")).where(F.size(F.col("words_filtered")) > 0) #remove words with less than 3 characters

wnl = WordNetLemmatizer()
@F.udf('array<string>')
def remove_words(words):
    return [word for word in words if wnl.lemmatize(word) in nltk.corpus.words.words()] #removing words that are not in nltk corpus

df = df.withColumn('words_final', remove_words('words_filtered'))

I get the output as shown below:我得到 output 如下所示：

df.select('words_final').show(10, truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                         words_final|
+----------------------------------------------------------------------------------------------------+
|[hereby, ordered, cease, desist, address, immediately, authorities, provider, continued, failure,...|
|[content, type, text, plain, content, transfer, printable, apparently, yahoo, tue, return, path, ...|
|[hereby, ordered, cease, desist, address, immediately, authorities, provider, continued, failure,...|
|[original, message, return, path, bounce, received, sender, bounce, tue, pst, results, received, ...|
|[message, return, pat, bounce, received, sender, bounce, tue, pst, results, received, ass, receiv...|
|                                                       [original, message, sent, ball, subject, get]|
|[horse, trailer, free, delivery, cash, back, click, view, horse, magazine, index, option, archive...|
|[original, message, sent, subject, notification, payment, number, hello, payment, amount, payment...|
|[client, specific, styles, outlook, padding, force, outlook, provide, view, browser, button, body...|
|[hereby, ordered, cease, desist, address, immediately, authorities, provider, continued, failure,...|
+----------------------------------------------------------------------------------------------------+
only showing top 10 rows

I see that stop words ( are, the, in, etc) and many junk words such as scottkahndmd or incomplete words such as furthe are removed.我看到停用词（ are, the, in,等）和许多垃圾词（例如scottkahndmd ）或不完整的词（例如furthe ）都被删除了。 But there are few words like emails, tuesday, february, encoding, quoted, online are also removed.但是emails, tuesday, february, encoding, quoted, online等几个词也被删除。 There could be more such English words which might be getting ignored.可能有更多这样的英语单词可能会被忽略。

Any reason for this?这有什么原因吗？

Answer 1

In your case, it looks like filtering happens in several places:在您的情况下，过滤似乎发生在几个地方：

the StopWordsRemover removes common words , like, he , she , myself , etc. Usually these words may not be very useful in the text models, but that depends on the task that you're trying to solve StopWordsRemover删除常用词，例如he 、 she 、 myself等。通常这些词在文本模型中可能不是很有用，但这取决于您要解决的任务
another layer of filtering is your WordNetLemmatizer - it could be the main culprit of removal of email , encoding , etc. Try to tune it, to be less aggressive in removal of the words另一层过滤是您的WordNetLemmatizer - 它可能是删除email 、 encoding等的罪魁祸首。尝试调整它，在删除单词时不那么激进

PS if you're doing NLP on Spark, I would recommend to look to the Spark NLP package instead. PS如果你在Spark上做NLP，我建议你看一下Spark NLP package。 It could be more performant, have more functionality, etc.它可能性能更高，功能更多，等等。

为什么使用停用词或 nltk 语料库后有些英语单词会被删除？

问题描述

1 个解决方案

解决方案1
0 2021-03-21 12:07:13

为什么使用停用词或 nltk 语料库后有些英语单词会被删除？

问题描述

1 个解决方案

解决方案1 0 2021-03-21 12:07:13

解决方案1
0 2021-03-21 12:07:13