简体   繁体   English

如何将可变大小的基于字符串的列拆分为 Pandas DataFrame 中的多列?

[英]How to split variable sized string-based column into multiple columns in Pandas DataFrame?

I have a pandas DataFrame which is of the form:我有一个 pandas DataFrame 形式为:

A      B       C     D
A1     6       7.5   NaN
A1     4       23.8  <D1 0.0 6.5 12 4, D2 1.0 4 3.5 1>
A2     7       11.9  <D1 2.0 7.5 10 2, D3 7.5 4.2 13.5 4> 
A3    11       0.8   <D2 2.0 7.5 10 2, D3 7.5 4.2 13.5 4, D4 2.0 7.5 10 2, D5 7.5 4.2 13.5 4>

The column D is a raw-string column with multiple categories in each entry. D列是一个原始字符串列,每个条目中有多个类别。 The value of entry is calculated by dividing the last two values for each category.条目的值是通过将每个类别的最后两个值相除来计算的。 For example, in 2nd row:例如,在第 2 行:

D1 = 12/4 = 3
D2 = 3.5/1 = 3.5

I need to split column D based on it's categories and join them to my DataFrame.我需要根据其类别拆分D列并将它们加入我的 DataFrame。 The problem is the column is dynamic and can have nearly 35-40 categories within a single entry.问题是该列是动态的,单个条目中可以包含近 35-40 个类别。 For now, all I'm doing is a Brute Force Approach by iterating all rows, which is very slow for large datasets.目前,我所做的只是通过迭代所有行的蛮力方法,这对于大型数据集来说非常慢。 Can someone please help me?有人可以帮帮我吗?

EXPECTED OUTCOME预期结果

A      B       C     D1  D2  D3  D4  D5
A1     6       7.5   NaN NaN NaN NaN NaN
A1     4       23.8  3.0 3.5 NaN NaN NaN
A2     7       11.9  5.0 NaN 3.4 NaN NaN 
A3    11       0.8   NaN 5.0 3.4 5.0 3.4

Use:利用:

d = df['D'].str.extractall(r'(D\d+).*?([\d.]+)\s([\d.]+)(?:,|\>)')
d = d.droplevel(1).set_index(0, append=True).astype(float)
d = df.join(d[1].div(d[2]).round(1).unstack()).drop('D', 1)

Details:细节:

Use Series.str.extractall to extract all the capture groups from the column D as specified by the regex pattern.使用Series.str.extractallregex模式指定的D列中提取所有捕获组。 You can test the regex pattern here .您可以在here测试regex模式。

print(d)
          0     1  2 # --> capture groups
  match             
1 0      D1    12  4
  1      D2   3.5  1
2 0      D1    10  2
  1      D3  13.5  4
3 0      D2    10  2
  1      D3  13.5  4
  2      D4    10  2
  3      D5  13.5  4

Use DataFrame.droplevel + set_index with optional parameter append=True to drop the unused level and append a new index to datafarme.使用DataFrame.droplevel + set_index和可选参数append=True来删除未使用的级别,并使用 append 一个新的 datafarme 索引。

print(d)
         1    2
  0            
1 D1  12.0  4.0
  D2   3.5  1.0
2 D1  10.0  2.0
  D3  13.5  4.0
3 D2  10.0  2.0
  D3  13.5  4.0
  D4  10.0  2.0
  D5  13.5  4.0

Use Series.div to divide column 1 by 2 and use Series.round to round the values then use Series.unstack to reshape the dataframe, then using DataFrame.join join the new dataframe with df使用Series.div1列除以2并使用Series.round舍入值,然后使用Series.unstack重塑 dataframe,然后使用DataFrame.join加入新的df

print(d)
    A   B     C   D1   D2   D3   D4   D5
0  A1   6   7.5  NaN  NaN  NaN  NaN  NaN
1  A1   4  23.8  3.0  3.5  NaN  NaN  NaN
2  A2   7  11.9  5.0  NaN  3.4  NaN  NaN
3  A3  11   0.8  NaN  5.0  3.4  5.0  3.4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM