計算數據框列中列表中單詞的出現次數

Question

有一個包含文本和單詞列表的數據框列。 我想要：

＃干凈的

刪除特殊字符 (. , ^ *...)
小寫
用空格分割文本中的每個單詞

#創建另一個數據框，顯示列表中包含的這些單詞的出現次數，如下所示：

df = pd.DataFrame([["word1 word,! word3 word4* word split5^", "other data"], ["word2 word,* word3 word4 word5", "other data"]], columns=['Description1', 'other colum'])

lista = ['word1', 'word2','word3','word4','word split5']

#Wanted result
df2 = pd.DataFrame([["word1", "1"], ["word2", "1"], ["word3", "2"], ["word4", "2"], ["word split5", "1"]], columns=['Listed words', 'occurences'])

Answer 1

我有一個代碼可以滿足你的要求

import pandas as pd

df = pd.DataFrame([["word1 word,! word3 word4* word split5^", "other data"], 
                   ["word2 word,* word3 word4 word5", "other data"]], 
                  columns=['Description1', 'other colum'])

# in the word list, split in words based on space
# for each word, strip of special characters and lower
# save list of all processed occurences to res
res = []
for i, elem in enumerate(df["Description1"].to_list()):
    res.extend([''.join(filter(str.isalnum, e)).lower() for e in elem.split(sep=" ")]) 

# import Counter, the easiest solution to count elements
from collections import Counter

# make a new df
df2 = pd.DataFrame()
df2 = df2.assign(ListedWords=Counter(res).keys(),    # list each unique elements
                 Occurences=Counter(res).values())   # list occurences
df2

輸出：

Out[66]: 
  ListedWords  Occurences
0       word1           1
1        word           3
2       word3           2
3       word4           2
4      split5           1
5       word2           1
6       word5           1

因此，代碼根據空格拆分單詞，刪除特殊字符並按照您的要求小寫單詞（按此順序）。
我有兩個評論：我使用Counter （內置）模塊，因為這是計算列表中單詞的最簡單方法。 此外，我的輸出看起來與示例中的輸出不同，因為如果您根據空格進行拆分，則"word split5"將不會出現在您的輸出中。 word,!一樣， word,! ：使用您的標准，這將作為word存儲在最終的 df 中，因為它是一個單獨的單詞（用空格表示），但特殊字符被刪除。

另請注意，由於 python dicts是無序的，因此列的順序不同。 您可以使用df2.sort_values(by = ["ListedWords"])對數據df2.sort_values(by = ["ListedWords"])的值進行排序。

計算數據框列中列表中單詞的出現次數

問題描述

1 個解決方案

解決方案1
0 2021-08-13 10:22:31

計算數據框列中列表中單詞的出現次數

問題描述

1 個解決方案

解決方案1 0 2021-08-13 10:22:31

解決方案1
0 2021-08-13 10:22:31