簡體   English   中英

根據每個句子的第一個單詞,將pandas dataframe列中的字符串列表分解為新列

[英]Break up a list of strings in a pandas dataframe column into new columns based on first word of each sentence

所以我有大約40,000行人和他們的抱怨。 我試圖將它們分類到各自的列進行分析,而對於我公司的其他使用其他工具的分析師可以使用這些數據。

DataFrame示例:

df = pd.DataFrame({"person": [1, 2, 3], 
                   "problems": ["body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, tired", 
                                "soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger", 
                                "none"]})
df     
╔═══╦════════╦══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║   ║ person ║                                                     problems                                                     ║
╠═══╬════════╬══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ 0 ║      1 ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, tired                                         ║
║ 1 ║      2 ║ soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger ║
║ 2 ║      3 ║ none                                                                                                             ║
╚═══╩════════╩══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

期望的輸出:

╔═══╦════════╦══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╦════════════════════════════════════════════════════════════════════════════════╦═══════════════════════╦═══════════════╗
║   ║ person ║                                                     problems                                                     ║                                      body                                      ║         mind          ║     soul      ║
╠═══╬════════╬══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╬════════════════════════════════════════════════════════════════════════════════╬═══════════════════════╬═══════════════╣
║ 0 ║      1 ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, tired                                         ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE)                              ║ mind: stressed, tired ║ NaN           ║
║ 1 ║      2 ║ soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger ║ body: feels great(lifts weights), overweight(always bulking), missing a finger ║ mind: can't think     ║ soul: missing ║
║ 2 ║      3 ║ none                                                                                                             ║ NaN                                                                            ║ NaN                   ║ NaN           ║
╚═══╩════════╩══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╩════════════════════════════════════════════════════════════════════════════════╩═══════════════════════╩═══════════════╝

我嘗試過的事情/我所在的地方:

所以我至少能夠用正則表達式將它們與我的真實數據分開。

df.problems.str.extractall(r"(\b(?!(?: \b))[\w\s.()',:/-]+)")


+---+-------+--------------------------------------------------------------------------------+
|   |       |                                       0                                        |
+---+-------+--------------------------------------------------------------------------------+
|   | match |                                                                                |
| 0 | 0     | body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE)                              |
|   | 1     | mind: stressed, tired                                                          |
| 1 | 0     | soul: missing                                                                  |
|   | 1     | mind: can't think                                                              |
|   | 2     | body: feels great(lifts weights), overweight(always bulking), missing a finger |
| 2 | 0     | none                                                                           |
+---+-------+--------------------------------------------------------------------------------+

我是一個正則表達式的初學者,所以我希望這可能會做得更好。 我的原始正則表達式模式是r'([^;]+)' ,但我試圖排除分號后的空格。

所以我很茫然。 我玩過:

df.problems.str.extractall(r"(\\b(?!(?: \\b))[\\w\\s.()',:/-]+)").unstack() ,“工作” (不會出錯)我的例子在這里。

但是根據我的真實數據,我收到一個錯誤: "ValueError: Index contains duplicate entries, cannot reshape"

即使它與我的真實數據一起工作,我仍然必須弄清楚如何將這些“類別”(身體,思想,靈魂)分配到指定的列中。

如果我能更好地說出這個問題,我可能會有更好的運氣。 我想在這里真正自學,所以即使他們不是一個完整的解決方案,我也會感激任何領導。

我有點嗅到一條小道,也許我可以通過groupby或multiIndex專有技術以某種方式做到這一點。 編程的新手,所以我仍然在黑暗中感受我的方式。 我很感激任何人提供的任何提示或想法。 謝謝!

編輯:我只是想回來並提到我在我的真實數據中遇到的錯誤"ValueError: Index contains duplicate entries, cannot reshape"使用@WeNYoBen的解決方案時:

(df.problems.str.extractall(r"(\b(?!(?: \b))[\w\s.()',:/-]+)")[0]
.str.split(':',expand=True)
.set_index(0,append=True)[1]
.unstack()
.groupby(level=0)
.first())

事實證明我有一些有多個冒號的團體。 例如:

df = pd.DataFrame({"person": [1, 2, 3], 
                   "problems": ["body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, energy: tired", 
                                "soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger", 
                                "none"]})




╔═══╦════════╦══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║   ║ person ║                                                     problems                                                     ║
╠═══╬════════╬══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ 0 ║      1 ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, energy: tired                                 ║
║ 1 ║      2 ║ soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger ║
║ 2 ║      3 ║ none                                                                                                             ║
╚═══╩════════╩══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

查看反映我發現的邊緣情況的第一行更新; mind: stressed, energy: tired ; mind: stressed, energy: tired

我能夠通過改變我的正則表達式來解決這個問題,說比賽的開頭必須是字符串的開頭或者以分號開頭。

splits = [r'(^)(.+?)[:]', r'(;)(.+?)[:]']
str.split('|'.join(splits)

之后我只需要重新調整set_index部分以獲得@ WeNYoBen的有用解決方案,所以我會堅持使用這個。

它不優雅,但它完成了工作:

df['split'] = df.problems.str.split(';')
df['mind'] = df.split.apply(
    lambda x: ''.join([category for category in x if 'mind' in category]))
df['body'] = df.split.apply(
    lambda x: ''.join([category for category in x if 'body' in category]))
df['soul'] = df.split.apply(
    lambda x: ''.join([category for category in x if 'soul' in category]))
df.drop('split', inplace=True)

你可以包裝

df[cat] = df.split.apply(lambda x: ''.join([category for category in x if cat in category])) 

在一個函數中,並在每個cat的數據框上運行它(例如cats=['mind', 'body', 'soul', 'whathaveyou', 'etc.']


編輯

正如@ ifly6指出的那樣,用戶輸入的字符串中可能存在關鍵字的交叉點。 為安全起見,應將該功能更改為

df[cat] = df.split.apply(lambda x: ''.join([category for category in x if category.startswith(cat)])) 

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM