简体   繁体   中英

How to replace a column of a pandas dataframe with only words that exist in the dictionary or a text file?

Hi I have a pandas dataframe and a text file that look a little like this:

df:
+----------------------------------+
|           Description            |
+----------------------------------+
| hello this is a great test $5435 |
| this is an432 entry              |
| ...                              |
| entry number 43535               |
+----------------------------------+

txt:
word1
word2
word3
...
wordn

The descriptions are not important.

I want to go through each row in the df split by ' ' and for each word if the word is in text then keep it otherwise delete it.

Example:

Suppose my text file looks like this

hello
this
is
a
test

and a description looks like this

"hello this is a great test $5435"

then the output would be hello this is a test because great and $5435 are not in text.

I can write something like this:

def clean_string(rows):
    for row in rows:
        string = row.split()
        cleansed_string = []
        for word in string:
            if word in text:
                cleansed_string.append(word)
        row = ' '.join(cleansed_string)

But is there a better way to achieve this?

Use:

with open('file.txt', encoding="utf8") as f:
    L = f.read().split('\n')

print (L)
['hello', 'this', 'is', 'a', 'test']

f = lambda x: ' '.join(y for y in x.split() if y in set(L))
df['Description'] = df['Description'].apply(f)

For improve performance:

s = set(L)
df['Description'] = [' '.join(y for y in x.split() if y in s) for x in df['Description']]

print (df)
            Description
0  hello this is a test
1               this is
2                      

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM