How to replace a column of a pandas dataframe with only words that exist in the dictionary or a text file?

Question

Hi I have a pandas dataframe and a text file that look a little like this:

df:
+----------------------------------+
|           Description            |
+----------------------------------+
| hello this is a great test $5435 |
| this is an432 entry              |
| ...                              |
| entry number 43535               |
+----------------------------------+

txt:
word1
word2
word3
...
wordn

The descriptions are not important.

I want to go through each row in the df split by ' ' and for each word if the word is in text then keep it otherwise delete it.

Example:

Suppose my text file looks like this

hello
this
is
a
test

and a description looks like this

"hello this is a great test $5435"

then the output would be hello this is a test because great and $5435 are not in text.

I can write something like this:

def clean_string(rows):
    for row in rows:
        string = row.split()
        cleansed_string = []
        for word in string:
            if word in text:
                cleansed_string.append(word)
        row = ' '.join(cleansed_string)

But is there a better way to achieve this?

Answer 1

Use:

with open('file.txt', encoding="utf8") as f:
    L = f.read().split('\n')

print (L)
['hello', 'this', 'is', 'a', 'test']

f = lambda x: ' '.join(y for y in x.split() if y in set(L))
df['Description'] = df['Description'].apply(f)

For improve performance:

s = set(L)
df['Description'] = [' '.join(y for y in x.split() if y in s) for x in df['Description']]

print (df)
            Description
0  hello this is a test
1               this is
2

How to replace a column of a pandas dataframe with only words that exist in the dictionary or a text file?

Question

1 answers

solution1
1 ACCPTED 2019-10-07 11:04:42

How to replace a column of a pandas dataframe with only words that exist in the dictionary or a text file?

Question

1 answers

solution1 1 ACCPTED 2019-10-07 11:04:42

solution1
1 ACCPTED 2019-10-07 11:04:42