简体   繁体   中英

How to add custom stop word list to StopWordsRemover

I am using pyspark.ml.feature.StopWordsRemover class on my pyspark dataframe. It has ID and Text column. In addition to default stop word list provided, I would like to add my own custom list to remove all numeric values from string.

I can see there is a method provided to add setStopWords for this class. I think I'm struggling with the proper syntax to use this method.

from pyspark.sql.functions import *
from pyspark.ml.feature import * 

a = StopWordsRemover(inputCol="words", outputCol="filtered")
b = a.transform(df)

The above code gives me expected results in the filtered column but it only removes / stops standard words. I'm looking for a method to add my own custom list which would have more words and numeric values that I wish to filter.

You can specify it with this :

stopwordList = ["word1","word2","word3"]

StopWordsRemover(inputCol="words", outputCol="filtered" ,stopWords=stopwordList)

A small Note:

The above solution replaces the original list of stop words with the list we supplied.
If you want to add your own stopwords in addition to the existing/predefined stopwords, then we need to append the list with the original list before passing into StopWordsRemover() as a parameter. We transform to set to remove any duplicate.

stopwordList = ["word1","word2","word3"] stopwordList.extend(StopWordsRemover().getStopWords())
stopwordList = list(set(stopwordList))#optionnal
StopWordsRemover(inputCol="words", outputCol="filtered" ,stopWords=stopwordList)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM