[英]Pandas Dataframe - Split string into multiple columns
我是 Pandas 框架的新手,我已經進行了足夠的搜索來解決我的問題,但沒有在網上獲得太多幫助。
我有一個如下所示的字符串列,我想將它轉換成單獨的列。 我的問題是我試過拆分它,但它沒有按照我需要的方式給我 output。
*-----------------------------------------------------------------------------*
| Total Visitor |
*-----------------------------------------------------------------------------*
| 2x Adult, 1x Adult + Audio Guide |
| 2x Adult, 2x Youth, 1x Children |
| 5x Adult + Audio Guide, 1x Children + Audio Guide, 1x Senior + Audio Guide |
*-----------------------------------------------------------------------------*
這是我用來拆分字符串但沒有給我預期的 output 的代碼。
df = data["Total Visitor"].str.split(",", n = 1, expand = True)
拆分字符串后,我的預期 Output應如下表所示:
*----------------------------------------------------------------------------------------------------------------*
| Adult | Adult + Audio Guide | Youth | Children | Children + AG | Senior + AG
*----------------------------------------------------------------------------------------------------------------*
| 2x Adult | 1x Adult + Audio Guide | - | - | - | -
|
| 2x Adult | - |2x Youth | 1x Children | - | -
| - | 5x Adult + Audio Guide | - | - |1x Children + Audio Guide| 1x Senior + Audio Guide |
*----------------------------------------------------------------------------------------------------------------*
我怎樣才能做到這一點? 任何幫助或指導都會很棒。
想法是與移除的數字鍵創建詞典的列表x
由regex
- ^\\d+x\\s+
( ^
是開始字符串, \\d+
是一個或多個整數和\\s+
是一種或多種空格),並傳遞給DataFrame
構造函數:
import re
L =[dict([(re.sub('^\d+x\s+',"",y),y) for y in x.split(', ')]) for x in df['Total Visitor']]
df = pd.DataFrame(L).fillna('-')
print (df)
Adult Adult + Audio Guide Youth Children \
0 2x Adult 1x Adult + Audio Guide - -
1 2x Adult - 2x Youth 1x Children
2 - 5x Adult + Audio Guide - -
Children + Audio Guide Senior + Audio Guide
0 - -
1 - -
2 1x Children + Audio Guide 1x Senior + Audio Guide
另一個類似的想法是用x
分割來自字典鍵的列名:
L = [dict([(y.split('x ')[1], y) for y in x.split(', ')]) for x in df['Total Visitor']]
df = pd.DataFrame(L).fillna('-')
這是使用熊貓方法的一種方法:
dstack = df['Total Visitor'].str.split(',', expand=True).stack().str.strip().to_frame()
dstack['cols'] = dstack[0].str.extract(r'\d+x\s(.*)')
df_out = dstack.set_index('cols', append=True)[0].reset_index(level=1, drop=True).unstack()
df_out
輸出:
cols Adult Adult + Audio Guide Children Children + Audio Guide Senior + Audio Guide Youth
0 2x Adult 1x Adult + Audio Guide NaN NaN NaN NaN
1 2x Adult NaN 1x Children NaN NaN 2x Youth
2 NaN 5x Adult + Audio Guide NaN 1x Children + Audio Guide 1x Senior + Audio Guide NaN
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.