每行中最常出現的單詞

Question

我正在嘗試在標記化數據框的每一行中找到最常用的單詞，如下所示：

print(df.tokenized_sents)

['apple', 'inc.', 'aapl', 'reported', 'fourth', 'consecutive', 'quarter', 'record', 'revenue', 'profit', 'combination', 'higher', 'iphone', 'prices', 'strong', 'app-store', 'sales', 'propelled', 'technology', 'giant', 'best', 'year', 'ever', 'revenue', 'three', 'months', 'ended', 'sept.']

['brussels', 'apple', 'inc.', 'aapl', '-.', 'chief', 'executive', 'tim', 'cook', 'issued', 'tech', 'giants', 'strongest', 'call', 'yet', 'u.s.-wide', 'data-protection', 'regulation', 'saying', 'individuals', 'personal', 'information', 'been', 'weaponized', 'mr.', 'cooks', 'call', 'came', 'sharply', 'worded', 'speech', 'before', 'p…']

...

wrds = []
for i in range(0, len(df) ):

    wrds.append( Counter(df["tokenized_sents"][i]).most_common(5) )

但它報告的列表為：

print(wrds)

[('revenue', 2), ('apple', 1), ('inc.', 1), ('aapl', 1), ('reported', 1)]
...

我想創建以下數據框；

print(final_df)

KeyWords                                                                         
revenue, apple, inc., aapl, reported
...

注意：最終數據幀的行不是列表，而是單個文本值，例如，收益，蘋果公司，apl，已報告，不是，[收入，蘋果公司，apl，已報告]

Answer 1

不知道是否可以更改返回格式，但是可以使用apply和lambda重新設置列的格式。 EG df = pd.DataFrame({'wrds':[[('revenue', 2), ('apple', 1), ('inc.', 1), ('aapl', 1), ('reported', 1)]]})

df.wrds.apply(lambda x: [item[0] for item in x])

僅返回單詞列表[revenue, apple, inc., aapl, reported]

Answer 2

像這樣嗎 使用 .apply()

# creating the dataframe
df = pd.DataFrame({"token": [['apple', 'inc.', 'aapl', 'reported', 'fourth', 'consecutive', 'quarter', 'record', 'revenue', 'profit', 'combination', 'higher', 'iphone', 'prices', 'strong', 'app-store', 'sales', 'propelled', 'technology', 'giant', 'best', 'year', 'ever', 'revenue', 'three', 'months', 'ended', 'sept.'], ['brussels', 'apple', 'inc.', 'aapl', '-.', 'chief', 'executive', 'tim', 'cook', 'issued', 'tech', 'giants', 'strongest', 'call', 'yet', 'u.s.-wide', 'data-protection', 'regulation', 'saying', 'individuals', 'personal', 'information', 'been', 'weaponized', 'mr.', 'cooks', 'call', 'came', 'sharply', 'worded', 'speech', 'before', 'p…']
]})
# fetching 5 most common words using .apply and assigning it to keywords column in dataframe
df["keywords"] = df.token.apply(lambda x: ', '.join(i[0] for i in Counter(x).most_common(5)))
df

輸出：

    token   keywords
0   [apple, inc., aapl, reported, fourth, consecut...   revenue, apple, inc., aapl, reported
1   [brussels, apple, inc., aapl, -., chief, execu...   call, brussels, apple, inc., aapl

使用for循環 .loc() 和 .itertuples()

df = pd.DataFrame({"token": [['apple', 'inc.', 'aapl', 'reported', 'fourth', 'consecutive', 'quarter', 'record', 'revenue', 'profit', 'combination', 'higher', 'iphone', 'prices', 'strong', 'app-store', 'sales', 'propelled', 'technology', 'giant', 'best', 'year', 'ever', 'revenue', 'three', 'months', 'ended', 'sept.'], ['brussels', 'apple', 'inc.', 'aapl', '-.', 'chief', 'executive', 'tim', 'cook', 'issued', 'tech', 'giants', 'strongest', 'call', 'yet', 'u.s.-wide', 'data-protection', 'regulation', 'saying', 'individuals', 'personal', 'information', 'been', 'weaponized', 'mr.', 'cooks', 'call', 'came', 'sharply', 'worded', 'speech', 'before', 'p…']
]})
df["Keyword"] = ""
for row in df.itertuples():
    xount = [i[0] for i in Counter(row.token).most_common(5)]
    df.loc[row.Index, "Keyword"] = ', '.join(i for i in xount)
df

輸出：

    token   Keyword
0   [apple, inc., aapl, reported, fourth, consecut...   revenue, apple, inc., aapl, reported
1   [brussels, apple, inc., aapl, -., chief, execu...   call, brussels, apple, inc., aapl

Answer 3

使用df.apply

例如：

import pandas as pd
from collections import Counter
tokenized_sents = [['apple', 'inc.', 'aapl', 'reported', 'fourth', 'consecutive', 'quarter', 'record', 'revenue', 'profit', 'combination', 'higher', 'iphone', 'prices', 'strong', 'app-store', 'sales', 'propelled', 'technology', 'giant', 'best', 'year', 'ever', 'revenue', 'three', 'months', 'ended', 'sept.'], 
                   ['brussels', 'apple', 'inc.', 'aapl', '-.', 'chief', 'executive', 'tim', 'cook', 'issued', 'tech', 'giants', 'strongest', 'call', 'yet', 'u.s.-wide', 'data-protection', 'regulation', 'saying', 'individuals', 'personal', 'information', 'been', 'weaponized', 'mr.', 'cooks', 'call', 'came', 'sharply', 'worded', 'speech', 'before', 'p…']

]

df = pd.DataFrame({"tokenized_sents": tokenized_sents})
final_df = pd.DataFrame({"KeyWords" : df["tokenized_sents"].apply(lambda x: [k for k, v in Counter(x).most_common(5)])}) 
#or
#final_df = pd.DataFrame({"KeyWords" : df["tokenized_sents"].apply(lambda x: ", ".join(k for k, v in Counter(x).most_common(5)))})
print(final_df)

輸出：

                               KeyWords
0  [revenue, apple, aapl, sales, ended]
1   [call, saying, apple, issued, aapl]

每行中最常出現的單詞

問題描述

3 個解決方案

解決方案1
0 2018-11-30 12:06:29

解決方案2
0 2018-11-30 12:07:08

解決方案3
0 已采納 2018-11-30 12:07:20

每行中最常出現的單詞

問題描述

3 個解決方案

解決方案1 0 2018-11-30 12:06:29

解決方案2 0 2018-11-30 12:07:08

解決方案3 0 已采納 2018-11-30 12:07:20

解決方案1
0 2018-11-30 12:06:29

解決方案2
0 2018-11-30 12:07:08

解決方案3
0 已采納 2018-11-30 12:07:20