[英]Splitting list items into separate columns - pandas data-frame
I have initial pandas data-frame that looks like this - each cell is a list of values initial input我有看起来像这样的初始 pandas 数据帧 - 每个单元格都是初始输入值的列表
Python script - to get the initial dataframe - like mentioned by Ian Thompson in this answer - Python 脚本 - 获取初始 dataframe - 就像 Ian Thompson 在这个答案中提到的那样 -
import pandas as pd
df_out1 = pd.DataFrame({
0: [
[None, 'A', 'B', 'C', 'D'],
[None, 'A1', 'B1', 'C1', 'D1'],
[None, 'A2', 'B2', 'C2', 'D2'],
],
1: [
[None] * 5,
[None] * 5,
[None] * 5,
],
2: [
['V', 'W', 'X', 'Y', 'Z'],
['V1', 'W1', 'X1', 'Y1', 'Z1'],
['V2', 'W2', 'X2', 'Y2', 'Z2'],
]
})
I want to format it like this - for each row - every item of a list forms a column and do this for all the repetitions/iterations - desired output我想像这样格式化它 - 对于每一行 - 列表中的每个项目 forms 一列并对所有重复/迭代执行此操作 -所需的 output
My original input data-set is huge - 10,000 rows and 40 columns.我的原始输入数据集非常庞大 - 10,000 行和 40 列。 I am executing below python script - although it is working and provides the desired output - when I am running it for 2000 rows and 40 columns - the run time is close to 1800 seconds which I think is on a higher side.我在 python 脚本下执行 - 尽管它正在工作并提供所需的 output - 当我运行它 2000 行和 40 列时 - 运行时间接近 1800 秒,我认为这是更高的一面。
Python script: df_out1 is the initial data-frame Python 脚本:df_out1 是初始数据帧
d = pd.DataFrame()
for x in range(len(df_out1)):
for y in range(len(df_out1.columns)):
d = d.append(pd.Series(df_out1[y][x]), ignore_index=True)
d.to_csv('inter_alm_output_' + str(time.strftime("%Y%m%d-%H%M%S")) + '.csv')
Is there a way to achieve this on lesser amount of time, in other words optimize it?有没有办法在更短的时间内实现这一目标,换句话说,优化它?
If this is your starting dataframe:如果这是您的起始 dataframe:
df = pd.DataFrame({
0 : [
[None, 'A', 'B', 'C', 'D'],
[None, 'A1', 'B1', 'C1', 'D1'],
[None, 'A2', 'B2', 'C2', 'D2'],
],
1 : [
[None]*5,
[None]*5,
[None]*5,
],
2 : [
['V', 'W', 'X', 'Y', 'Z'],
['V1', 'W1', 'X1', 'Y1', 'Z1'],
['V2', 'W2', 'X2', 'Y2', 'Z2'],
]
})
You can reformat the columns by applying pd.Series
and concatenating the results.您可以通过应用pd.Series
并连接结果来重新格式化列。
print(pd.concat([
df[i].apply(pd.Series) for i in df.columns
]).sort_index().reset_index(drop=True))
0 1 2 3 4
0 None A B C D
1 None None None None None
2 V W X Y Z
3 None A1 B1 C1 D1
4 None None None None None
5 V1 W1 X1 Y1 Z1
6 None A2 B2 C2 D2
7 None None None None None
8 V2 W2 X2 Y2 Z2
Another method without using pd.concat
:另一种不使用pd.concat
方法:
print(df.stack().reset_index(drop=True).apply(pd.Series))
0 1 2 3 4
0 None A B C D
1 None None None None None
2 V W X Y Z
3 None A1 B1 C1 D1
4 None None None None None
5 V1 W1 X1 Y1 Z1
6 None A2 B2 C2 D2
7 None None None None None
8 V2 W2 X2 Y2 Z2
This first method completes in第一种方法在
3.93 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each
The second method completes in第二种方法完成
2.34 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Your original code completes in您的原始代码在
15 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
IIUC, you can get your desired resuls with this. IIUC,你可以得到你想要的结果。
Input输入
group count value
0 [None, A, B, C, D] [None, None, None, None] [v, w, x, y, z]
1 [None, A1, B1, C1, D1] [None, None, None, None] [v1, w1, x1, y1, z1]
2 [None, A2, B2, C2, D2] [None, None, None, None] [v2, w2, x2, y2, z2]
Code代码
df1 = df.stack().droplevel(1).reset_index(name='col').drop('index',axis=1)
pd.DataFrame(df1['col'].values.tolist(), columns=['M','N','O','P','Q'])
Output Output
M N O P Q
0 None A B C D
1 None None None None None
2 v w x y z
3 None A1 B1 C1 D1
4 None None None None None
5 v1 w1 x1 y1 z1
6 None A2 B2 C2 D2
7 None None None None None
8 v2 w2 x2 y2 z2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.