[英]split one pandas column text to multiple columns
For example, I have one pandas column contain 例如,我有一个pandas列包含
text
A1V2
B2C7Z1
I want split it into 26(AZ) columns with alphabet followed value, if it is missing, then -1. 我想将它拆分为26(AZ)列,其中字母跟随值,如果缺少,则为-1。
So, it can be 所以,它可以
text A B C D ... Z
A1V2 1 -1 -1 -1 ... -1
B2C7Z1 -1 2 7 -1 ... 1
Is there any fast way rather than using df.apply()? 有没有快速的方式,而不是使用df.apply()?
Followup: Thank Psidom for the brilliant answer. 跟进:感谢Psidom的精彩回答。 When I use the method run 4 millions rows, it took me 1 hour.
当我使用该方法运行4百万行时,我花了1个小时。 I hope there's another way can make it faster.
我希望有另一种方法可以让它更快。 It seems str.extractall() is the most time-consuming one.
似乎str.extractall()是最耗时的。
Try str.extractall
with regex (?P<key>[AZ])(?P<value>[0-9]+)
which extracts the key( [AZ] ) value( [0-9]+ ) into separate columns and a long to wide transform should get you there. 尝试
str.extractall
与正则表达式(?P<key>[AZ])(?P<value>[0-9]+)
将密钥( [AZ] )值( [0-9] + )提取到单独的列中从长到宽的变换应该会让你到那里。
Here regex (?P<key>[AZ])(?P<value>[0-9]+)
matches letterDigits pattern and the two capture groups go into two separate columns in the result as columns key and value (with ?P<>
syntax); 这里的正则表达式
(?P<key>[AZ])(?P<value>[0-9]+)
与letterDigits模式匹配,两个捕获组在结果中分为两列,分别为列键和值 (带?P<>
语法);
And since extractall puts multiple matches into separate rows, you will need to transform it to wide format with unstack
on the key
column: 由于extractall将多个匹配放入单独的行中,因此您需要将其转换为宽格式,并在
key
列上使用unstack
:
(df.text.str.extractall("(?P<key>[A-Z])(?P<value>[0-9]+)")
.reset_index('match', drop=True)
.set_index('key', append=True)
.value.unstack('key').fillna(-1))
#key A B C V Z
# 0 1 -1 -1 2 -1
# 1 -1 2 7 -1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.