[英]Expand integer ranges in pandas DataFrame column
I have a dataframe that looks like: 我有一个数据框,看起来像:
d = {'value': ['a','b','c','d','e','f','g', 'h'],\
'id' : ['0101', '0208', '0103', '0405', '0105,0116,0117',
'0108-0110', '0231, 0232, 0133-0150', '0155, 0152-0154, 0151']}
df = pd.DataFrame(d)
>>>
value id
0 a 0101
1 b 0208
2 c 0103
3 d 0405
4 e 0105
5 e 0116
6 e 0117
7 f 0108
8 f 0109
9 f 0110
10 g 0231, 0232, 0133-0150
11 h 0155, 0152-0154, 0151
but I need to expand these IDs so that each row is a single number, so it looks more like: 但我需要扩展这些ID,以便每一行都是一个数字,因此看起来更像:
value id
0 a 0101
1 b 0208
2 c 0103
3 d 0405
4 e 0105
5 e 0116
6 e 0117
7 f 0108
8 f 0109
9 f 0110
10 g ...
where each row is duplicated where the IDs were grouped (with the ranges expanded, and leading zeros preserved for IDs less than 4 digits). 其中每行在ID进行分组的地方重复(范围扩大,并且ID少于4位保留前导零)。
I've got as far as 我已经尽力了
df['id'].str.split(",")
df['id'].str.contains("-")
but I can't think of a good way to do this. 但我想不出一个好方法。 Can anyone help?
有人可以帮忙吗?
You can write a little routine to flatten your ranges, and then repeat values from the original as necessary. 您可以编写一些例程来展平范围,然后根据需要重复原始值。
from itertools import chain
flattened = []
for x in df['id'].str.split(r',\s*'):
flattened.append([])
for y in x:
if '-' in y:
start, end = pd.to_numeric(y.split('-'))
flattened[-1].extend(pd.RangeIndex(start, end+1))
else:
flattened[-1].append(int(y))
repeats = [len(f) for f in flattened]
df_flat = pd.DataFrame({
'value': df.value.repeat(repeats).values,
'id': list(chain.from_iterable(flattened))})
df_flat.tail(10)
value id
25 g 146
26 g 147
27 g 148
28 g 149
29 g 150
30 h 155
31 h 152
32 h 153
33 h 154
34 h 151
This turns out to be pretty performant, even for larger data. 事实证明,即使对于较大的数据,这也相当不错。
df_ = df
df = pd.concat([df_] * 1000, ignore_index=True)
%timeit flatten(df) # Function running code above.
244 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here's a way to do it: 这是一种方法:
s = (df['id'].str.split(r"[, ]|[-]")
.apply(pd.Series)
.stack()
.reset_index(level=1, drop=True))
df.drop('id', axis =1).join(s.to_frame()).reset_index(drop=True)
value 0
0 a 0101
1 b 0208
2 c 0103
3 d 0405
4 e 0105
5 e 0116
6 e 0117
7 f 0108
8 f 0109
9 f 0110
10 g 0231
11 g 0232
12 g 0133
13 g 0150
14 h 0155
15 h 0152
16 h 0154
17 h 0151
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.