简体   繁体   English

将 pandas DataFrame 列拆分为可变数量的列

[英]Split a pandas DataFrame column into a variable number of columns

I have a DataFrame that looks like this (code to produce this at end):我有一个 DataFrame 看起来像这样(最后生成它的代码):

在此处输入图像描述

... and I want to basically split up the index column, to get to this: ...我想基本上拆分index列,以达到以下目的:

在此处输入图像描述

There could be a variable number of comma-separated numbers after each Type.ID .每个Type.ID之后可能会有数量可变的逗号分隔数字。 I've written a function that does the splitting up for individual strings, but I don't know how to apply it to a column (I looked at apply ).我写了一个 function 来拆分单个字符串,但我不知道如何将它应用于列(我查看了apply )。

Thank you for your help: Code to set up input DataFrame:感谢您的帮助:设置输入 DataFrame 的代码:

pd.DataFrame({
    'index': pd.Series(['FirstType.FirstID', 'OtherType.OtherID,1','OtherType.OtherID,4','LastType.LastID,1,1', 'LastType.LastID,1,2', 'LastType.LastID,2,3'],dtype='object',index=pd.RangeIndex(start=0, stop=6, step=1)),
    'value': pd.Series([0.23, 50, 60, 110.0, 199.0, 123.0],dtype='float64',index=pd.RangeIndex(start=0, stop=6, step=1)),
}, index=pd.RangeIndex(start=0, stop=6, step=1))

Code to split up index values:拆分索引值的代码:

import re
def get_header_properties(header):
    pf_type = re.match(".*?(?=\.)", header).group()
    pf_id = re.search(f"(?<={pf_type}\.).*?(?=(,|$))", header).group()
    pf_coords = re.search(f"(?<={pf_id}).*", header).group()
    return pf_type, pf_id, pf_coords.split(",")[1:]

get_header_properties("Type.ID,0.625,0.08333")
#-> ('Type', 'ID', ['0.625', '0.08333'])

You could slightly change the function and use it in a list comprehension;您可以稍微更改 function 并将其用于列表理解; then assign the nested list to columns:然后将嵌套列表分配给列:

def get_header_properties(header):
    pf_type = re.match(".*?(?=\.)", header).group()
    pf_id = re.search(f"(?<={pf_type}\.).*?(?=(,|$))", header).group()
    pf_coords = re.search(f"(?<={pf_id}).*", header).group()
    coords = pf_coords.split(",")[1:]
    return [pf_type, pf_id] + coords + ([np.nan]*(2-len(coords)) if len(coords)<2 else [])

df[['Type','ID','dim1','dim2']] = [get_header_properties(i) for i in df['index']]
out = df.drop(columns='index')[['Type','ID','dim1','dim2','value']]

That said, instead of the function, it seems it's simpler and more efficient to use str.split once on "index" column and join it to df :也就是说,而不是 function,似乎在“索引”列上使用一次str.split并将其join df更简单、更有效:

df = (df['index'].str.split('[.,]', expand=True)
      .fillna(np.nan)
      .rename(columns={i: col for i,col in enumerate(['Type','ID','dim1','dim2'])})
      .join(df[['value']]))

Output: Output:

        Type       ID dim1 dim2   value
0  FirstType  FirstID  NaN  NaN    0.23
1  OtherType  OtherID    1  NaN   50.00
2  OtherType  OtherID    4  NaN   60.00
3   LastType   LastID    1    1  110.00
4   LastType   LastID    1    2  199.00
5   LastType   LastID    2    3  123.00

You can directly expand a regex over the problematic column!您可以直接在有问题的列上扩展正则表达式!

>>> df["index"].str.extract(r"([^\.]+)\.([^,]+)(?:,(\d+))?(?:,(\d+))?")
           0        1    2    3
0  FirstType  FirstID  NaN  NaN
1  OtherType  OtherID    1  NaN
2  OtherType  OtherID    4  NaN
3   LastType   LastID    1    1
4   LastType   LastID    1    2
5   LastType   LastID    2    3

Joining the value column to the end (opportunity for other columns here too)value列连接到最后(这里也有其他列的机会)

df_idx = df["index"].str.extract(r"([^\.]+)\.([^,]+)(?:,(\d+))?(?:,(\d+))?")
df = df_idx.join(df[["value"]])
df = df.rename({0: "Type", 1: "ID", 2: "dim1", 3: "dim2"}, axis=1)

>>> df
        Type       ID dim1 dim2   value
0  FirstType  FirstID  NaN  NaN    0.23
1  OtherType  OtherID    1  NaN   50.00
2  OtherType  OtherID    4  NaN   60.00
3   LastType   LastID    1    1  110.00
4   LastType   LastID    1    2  199.00
5   LastType   LastID    2    3  123.00

IMO, the simplest is just to split : IMO,最简单的就是split

df2 = df['index'].str.split('[,.]', expand=True)
df2.columns = ['Type', 'ID', 'dim1', 'dim2']

df2 = df2.join(df['value'])

NB.注意。 The regex relies here on the dot/comma separators, but you can adapt if needed此处的正则表达式依赖于点/逗号分隔符,但您可以根据需要进行调整

Output: Output:

        Type       ID  dim1  dim2   value
0  FirstType  FirstID  None  None    0.23
1  OtherType  OtherID     1  None   50.00
2  OtherType  OtherID     4  None   60.00
3   LastType   LastID     1     1  110.00
4   LastType   LastID     1     2  199.00
5   LastType   LastID     2     3  123.00

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM