[英]How to split variable sized string-based column into multiple columns in Pandas DataFrame?
I have a pandas DataFrame which is of the form:我有一个 pandas DataFrame 形式为:
A B C D
A1 6 7.5 NaN
A1 4 23.8 <D1 0.0 6.5 12 4, D2 1.0 4 3.5 1>
A2 7 11.9 <D1 2.0 7.5 10 2, D3 7.5 4.2 13.5 4>
A3 11 0.8 <D2 2.0 7.5 10 2, D3 7.5 4.2 13.5 4, D4 2.0 7.5 10 2, D5 7.5 4.2 13.5 4>
The column D is a raw-string column with multiple categories in each entry. D列是一个原始字符串列,每个条目中有多个类别。 The value of entry is calculated by dividing the last two values for each category.
条目的值是通过将每个类别的最后两个值相除来计算的。 For example, in 2nd row:
例如,在第 2 行:
D1 = 12/4 = 3
D2 = 3.5/1 = 3.5
I need to split column D based on it's categories and join them to my DataFrame.我需要根据其类别拆分D列并将它们加入我的 DataFrame。 The problem is the column is dynamic and can have nearly 35-40 categories within a single entry.
问题是该列是动态的,单个条目中可以包含近 35-40 个类别。 For now, all I'm doing is a Brute Force Approach by iterating all rows, which is very slow for large datasets.
目前,我所做的只是通过迭代所有行的蛮力方法,这对于大型数据集来说非常慢。 Can someone please help me?
有人可以帮帮我吗?
EXPECTED OUTCOME预期结果
A B C D1 D2 D3 D4 D5
A1 6 7.5 NaN NaN NaN NaN NaN
A1 4 23.8 3.0 3.5 NaN NaN NaN
A2 7 11.9 5.0 NaN 3.4 NaN NaN
A3 11 0.8 NaN 5.0 3.4 5.0 3.4
Use:利用:
d = df['D'].str.extractall(r'(D\d+).*?([\d.]+)\s([\d.]+)(?:,|\>)')
d = d.droplevel(1).set_index(0, append=True).astype(float)
d = df.join(d[1].div(d[2]).round(1).unstack()).drop('D', 1)
Details:细节:
Use Series.str.extractall
to extract all the capture groups from the column D
as specified by the regex
pattern.使用
Series.str.extractall
从regex
模式指定的D
列中提取所有捕获组。 You can test the regex
pattern here
.您可以在
here
测试regex
模式。
print(d)
0 1 2 # --> capture groups
match
1 0 D1 12 4
1 D2 3.5 1
2 0 D1 10 2
1 D3 13.5 4
3 0 D2 10 2
1 D3 13.5 4
2 D4 10 2
3 D5 13.5 4
Use DataFrame.droplevel
+ set_index
with optional parameter append=True
to drop the unused level and append a new index to datafarme.使用
DataFrame.droplevel
+ set_index
和可选参数append=True
来删除未使用的级别,并使用 append 一个新的 datafarme 索引。
print(d)
1 2
0
1 D1 12.0 4.0
D2 3.5 1.0
2 D1 10.0 2.0
D3 13.5 4.0
3 D2 10.0 2.0
D3 13.5 4.0
D4 10.0 2.0
D5 13.5 4.0
Use Series.div
to divide column 1
by 2
and use Series.round
to round the values then use Series.unstack
to reshape the dataframe, then using DataFrame.join
join the new dataframe with df
使用
Series.div
将1
列除以2
并使用Series.round
舍入值,然后使用Series.unstack
重塑 dataframe,然后使用DataFrame.join
加入新的df
print(d)
A B C D1 D2 D3 D4 D5
0 A1 6 7.5 NaN NaN NaN NaN NaN
1 A1 4 23.8 3.0 3.5 NaN NaN NaN
2 A2 7 11.9 5.0 NaN 3.4 NaN NaN
3 A3 11 0.8 NaN 5.0 3.4 5.0 3.4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.