Removing stopwords from text corpus using linux commandline

Question

I have about 200MB of text file (rawtext.txt) and have a list of stop words in a text file (stopwords.txt).

I
a
about
an
are
as
at
be
by
com
for

...

I want to remove the stopwords in the text corpus. But how? What is the fastest and easiest way? Prefer a command line like sed or tr. Dont want to use python or NLTK.

Can somebody help? I am using Mac OSX (not linux)

Answer 1

Convert your input to word-per-line format, and you can filter it with grep :

tr -s '[:blank:]' '\n' < rawtext.txt | fgrep -vwf stopwords.txt

This way you don't have to build an arbitrarily large regexp, which could be a problem if your stopwords list is large.

Answer 2

A working solution (also in Mac OS):

cat rawtext.txt | grep -o -E '[a-zA-Z]{3,}' | tr '[:upper:]' '[:lower:]' | sort | uniq | grep -vwFf stopwords.txt

This would extract just the 3-letter words (without numbers), convert to lowercase, sort and get uniques, then filter with the stop words.

Make sure stopwords.txt was treated in the same way (eg lowercase).

Removing stopwords from text corpus using linux commandline

Question

2 answers

solution1
1 2015-10-10 21:23:01

solution2
1 2020-12-28 11:21:27

Removing stopwords from text corpus using linux commandline

Question

2 answers

solution1 1 2015-10-10 21:23:01

solution2 1 2020-12-28 11:21:27

solution1
1 2015-10-10 21:23:01

solution2
1 2020-12-28 11:21:27