简体   繁体   中英

Count occurences of word within a list in a dataframe column

There is a dataframe column with text and a list of words. I would like to:

#Clean

  • Remove special characters (. , ^ *...)
  • lower case
  • split each word in the text with a space

#Create another dataframe that displays occurrences of those words contained within the list as per below:

df = pd.DataFrame([["word1 word,! word3 word4* word split5^", "other data"], ["word2 word,* word3 word4 word5", "other data"]], columns=['Description1', 'other colum'])

lista = ['word1', 'word2','word3','word4','word split5']

#Wanted result
df2 = pd.DataFrame([["word1", "1"], ["word2", "1"], ["word3", "2"], ["word4", "2"], ["word split5", "1"]], columns=['Listed words', 'occurences'])

I have a code that does what you ask

import pandas as pd

df = pd.DataFrame([["word1 word,! word3 word4* word split5^", "other data"], 
                   ["word2 word,* word3 word4 word5", "other data"]], 
                  columns=['Description1', 'other colum'])

# in the word list, split in words based on space
# for each word, strip of special characters and lower
# save list of all processed occurences to res
res = []
for i, elem in enumerate(df["Description1"].to_list()):
    res.extend([''.join(filter(str.isalnum, e)).lower() for e in elem.split(sep=" ")]) 

# import Counter, the easiest solution to count elements
from collections import Counter

# make a new df
df2 = pd.DataFrame()
df2 = df2.assign(ListedWords=Counter(res).keys(),    # list each unique elements
                 Occurences=Counter(res).values())   # list occurences
df2

Output:

Out[66]: 
  ListedWords  Occurences
0       word1           1
1        word           3
2       word3           2
3       word4           2
4      split5           1
5       word2           1
6       word5           1

So the code splits words based on the space, removes special characters and lower cases the words (in this order) like you asked.
I have two remarks: I use the module Counter (built-in) since this is the easiest way to count words in a list. Also, my output looks different from the one in the example because if you split based on a space there is no way that "word split5" would be in your output. The same counts for word,! : using your criteria this will be stored in the final df as word since it is a separate word (denoted by spaces) but the special characters are stripped.

Also note that the order of the column is not the same since python dicts are unordered. You can use df2.sort_values(by = ["ListedWords"]) to sort the values of your dataframe.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM