[英]Split a pandas DataFrame column into a variable number of columns
我有一個 DataFrame 看起來像這樣(最后生成它的代碼):
...我想基本上拆分index
列,以達到以下目的:
每個Type.ID
之后可能會有數量可變的逗號分隔數字。 我寫了一個 function 來拆分單個字符串,但我不知道如何將它應用於列(我查看了apply
)。
感謝您的幫助:設置輸入 DataFrame 的代碼:
pd.DataFrame({
'index': pd.Series(['FirstType.FirstID', 'OtherType.OtherID,1','OtherType.OtherID,4','LastType.LastID,1,1', 'LastType.LastID,1,2', 'LastType.LastID,2,3'],dtype='object',index=pd.RangeIndex(start=0, stop=6, step=1)),
'value': pd.Series([0.23, 50, 60, 110.0, 199.0, 123.0],dtype='float64',index=pd.RangeIndex(start=0, stop=6, step=1)),
}, index=pd.RangeIndex(start=0, stop=6, step=1))
拆分索引值的代碼:
import re
def get_header_properties(header):
pf_type = re.match(".*?(?=\.)", header).group()
pf_id = re.search(f"(?<={pf_type}\.).*?(?=(,|$))", header).group()
pf_coords = re.search(f"(?<={pf_id}).*", header).group()
return pf_type, pf_id, pf_coords.split(",")[1:]
get_header_properties("Type.ID,0.625,0.08333")
#-> ('Type', 'ID', ['0.625', '0.08333'])
您可以稍微更改 function 並將其用於列表理解; 然后將嵌套列表分配給列:
def get_header_properties(header):
pf_type = re.match(".*?(?=\.)", header).group()
pf_id = re.search(f"(?<={pf_type}\.).*?(?=(,|$))", header).group()
pf_coords = re.search(f"(?<={pf_id}).*", header).group()
coords = pf_coords.split(",")[1:]
return [pf_type, pf_id] + coords + ([np.nan]*(2-len(coords)) if len(coords)<2 else [])
df[['Type','ID','dim1','dim2']] = [get_header_properties(i) for i in df['index']]
out = df.drop(columns='index')[['Type','ID','dim1','dim2','value']]
也就是說,而不是 function,似乎在“索引”列上使用一次str.split
並將其join
df
更簡單、更有效:
df = (df['index'].str.split('[.,]', expand=True)
.fillna(np.nan)
.rename(columns={i: col for i,col in enumerate(['Type','ID','dim1','dim2'])})
.join(df[['value']]))
Output:
Type ID dim1 dim2 value
0 FirstType FirstID NaN NaN 0.23
1 OtherType OtherID 1 NaN 50.00
2 OtherType OtherID 4 NaN 60.00
3 LastType LastID 1 1 110.00
4 LastType LastID 1 2 199.00
5 LastType LastID 2 3 123.00
您可以直接在有問題的列上擴展正則表達式!
>>> df["index"].str.extract(r"([^\.]+)\.([^,]+)(?:,(\d+))?(?:,(\d+))?")
0 1 2 3
0 FirstType FirstID NaN NaN
1 OtherType OtherID 1 NaN
2 OtherType OtherID 4 NaN
3 LastType LastID 1 1
4 LastType LastID 1 2
5 LastType LastID 2 3
將value
列連接到最后(這里也有其他列的機會)
df_idx = df["index"].str.extract(r"([^\.]+)\.([^,]+)(?:,(\d+))?(?:,(\d+))?")
df = df_idx.join(df[["value"]])
df = df.rename({0: "Type", 1: "ID", 2: "dim1", 3: "dim2"}, axis=1)
>>> df
Type ID dim1 dim2 value
0 FirstType FirstID NaN NaN 0.23
1 OtherType OtherID 1 NaN 50.00
2 OtherType OtherID 4 NaN 60.00
3 LastType LastID 1 1 110.00
4 LastType LastID 1 2 199.00
5 LastType LastID 2 3 123.00
IMO,最簡單的就是split
:
df2 = df['index'].str.split('[,.]', expand=True)
df2.columns = ['Type', 'ID', 'dim1', 'dim2']
df2 = df2.join(df['value'])
注意。 此處的正則表達式依賴於點/逗號分隔符,但您可以根據需要進行調整
Output:
Type ID dim1 dim2 value
0 FirstType FirstID None None 0.23
1 OtherType OtherID 1 None 50.00
2 OtherType OtherID 4 None 60.00
3 LastType LastID 1 1 110.00
4 LastType LastID 1 2 199.00
5 LastType LastID 2 3 123.00
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.