![](/img/trans.png)
[英]Merge n strings or sentence in dataframe column with common last and first word
[英]Break up a list of strings in a pandas dataframe column into new columns based on first word of each sentence
所以我有大約40,000行人和他們的抱怨。 我試圖將它們分類到各自的列進行分析,而對於我公司的其他使用其他工具的分析師可以使用這些數據。
DataFrame示例:
df = pd.DataFrame({"person": [1, 2, 3],
"problems": ["body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, tired",
"soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger",
"none"]})
df
╔═══╦════════╦══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║ ║ person ║ problems ║
╠═══╬════════╬══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ 0 ║ 1 ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, tired ║
║ 1 ║ 2 ║ soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger ║
║ 2 ║ 3 ║ none ║
╚═══╩════════╩══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝
期望的輸出:
╔═══╦════════╦══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╦════════════════════════════════════════════════════════════════════════════════╦═══════════════════════╦═══════════════╗
║ ║ person ║ problems ║ body ║ mind ║ soul ║
╠═══╬════════╬══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╬════════════════════════════════════════════════════════════════════════════════╬═══════════════════════╬═══════════════╣
║ 0 ║ 1 ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, tired ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE) ║ mind: stressed, tired ║ NaN ║
║ 1 ║ 2 ║ soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger ║ body: feels great(lifts weights), overweight(always bulking), missing a finger ║ mind: can't think ║ soul: missing ║
║ 2 ║ 3 ║ none ║ NaN ║ NaN ║ NaN ║
╚═══╩════════╩══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╩════════════════════════════════════════════════════════════════════════════════╩═══════════════════════╩═══════════════╝
我嘗試過的事情/我所在的地方:
所以我至少能夠用正則表達式將它們與我的真實數據分開。
df.problems.str.extractall(r"(\b(?!(?: \b))[\w\s.()',:/-]+)")
+---+-------+--------------------------------------------------------------------------------+
| | | 0 |
+---+-------+--------------------------------------------------------------------------------+
| | match | |
| 0 | 0 | body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE) |
| | 1 | mind: stressed, tired |
| 1 | 0 | soul: missing |
| | 1 | mind: can't think |
| | 2 | body: feels great(lifts weights), overweight(always bulking), missing a finger |
| 2 | 0 | none |
+---+-------+--------------------------------------------------------------------------------+
我是一個正則表達式的初學者,所以我希望這可能會做得更好。 我的原始正則表達式模式是r'([^;]+)'
,但我試圖排除分號后的空格。
所以我很茫然。 我玩過:
df.problems.str.extractall(r"(\\b(?!(?: \\b))[\\w\\s.()',:/-]+)").unstack()
,“工作” (不會出錯)我的例子在這里。
但是根據我的真實數據,我收到一個錯誤: "ValueError: Index contains duplicate entries, cannot reshape"
即使它與我的真實數據一起工作,我仍然必須弄清楚如何將這些“類別”(身體,思想,靈魂)分配到指定的列中。
如果我能更好地說出這個問題,我可能會有更好的運氣。 我想在這里真正自學,所以即使他們不是一個完整的解決方案,我也會感激任何領導。
我有點嗅到一條小道,也許我可以通過groupby或multiIndex專有技術以某種方式做到這一點。 編程的新手,所以我仍然在黑暗中感受我的方式。 我很感激任何人提供的任何提示或想法。 謝謝!
編輯:我只是想回來並提到我在我的真實數據中遇到的錯誤"ValueError: Index contains duplicate entries, cannot reshape"
使用@WeNYoBen的解決方案時:
(df.problems.str.extractall(r"(\b(?!(?: \b))[\w\s.()',:/-]+)")[0]
.str.split(':',expand=True)
.set_index(0,append=True)[1]
.unstack()
.groupby(level=0)
.first())
事實證明我有一些有多個冒號的團體。 例如:
df = pd.DataFrame({"person": [1, 2, 3],
"problems": ["body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, energy: tired",
"soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger",
"none"]})
╔═══╦════════╦══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║ ║ person ║ problems ║
╠═══╬════════╬══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ 0 ║ 1 ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, energy: tired ║
║ 1 ║ 2 ║ soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger ║
║ 2 ║ 3 ║ none ║
╚═══╩════════╩══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝
查看反映我發現的邊緣情況的第一行更新; mind: stressed, energy: tired
; mind: stressed, energy: tired
。
我能夠通過改變我的正則表達式來解決這個問題,說比賽的開頭必須是字符串的開頭或者以分號開頭。
splits = [r'(^)(.+?)[:]', r'(;)(.+?)[:]']
str.split('|'.join(splits)
之后我只需要重新調整set_index部分以獲得@ WeNYoBen的有用解決方案,所以我會堅持使用這個。
它不優雅,但它完成了工作:
df['split'] = df.problems.str.split(';')
df['mind'] = df.split.apply(
lambda x: ''.join([category for category in x if 'mind' in category]))
df['body'] = df.split.apply(
lambda x: ''.join([category for category in x if 'body' in category]))
df['soul'] = df.split.apply(
lambda x: ''.join([category for category in x if 'soul' in category]))
df.drop('split', inplace=True)
你可以包裝
df[cat] = df.split.apply(lambda x: ''.join([category for category in x if cat in category]))
在一個函數中,並在每個cat
的數據框上運行它(例如cats=['mind', 'body', 'soul', 'whathaveyou', 'etc.']
。
編輯 :
正如@ ifly6指出的那樣,用戶輸入的字符串中可能存在關鍵字的交叉點。 為安全起見,應將該功能更改為
df[cat] = df.split.apply(lambda x: ''.join([category for category in x if category.startswith(cat)]))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.