简体   繁体   中英

Removing stopwords from text corpus using linux commandline

I have about 200MB of text file (rawtext.txt) and have a list of stop words in a text file (stopwords.txt).

I
a
about
an
are
as
at
be
by
com
for

...

I want to remove the stopwords in the text corpus. But how? What is the fastest and easiest way? Prefer a command line like sed or tr. Dont want to use python or NLTK.

Can somebody help? I am using Mac OSX (not linux)

Convert your input to word-per-line format, and you can filter it with grep :

tr -s '[:blank:]' '\n' < rawtext.txt | fgrep -vwf stopwords.txt 

This way you don't have to build an arbitrarily large regexp, which could be a problem if your stopwords list is large.

A working solution (also in Mac OS):

cat rawtext.txt | grep -o -E '[a-zA-Z]{3,}' | tr '[:upper:]' '[:lower:]' | sort | uniq | grep -vwFf stopwords.txt

This would extract just the 3-letter words (without numbers), convert to lowercase, sort and get uniques, then filter with the stop words.

Make sure stopwords.txt was treated in the same way (eg lowercase).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM