[英]Generate pandas dataframe based on another dataframe
I need to generate a dataframe based on another one.我需要根据另一个数据帧生成一个数据帧。 There are two steps based on input df.基于输入df有两个步骤。
The input df has 4 columns.输入 df 有 4 列。 The output should be done this way: 1) Take value from col1
to generate that many rows in output, where col opt
is rewritten, new_col1
equals f"{value_from_col0}_{loop_iterator_with_limit_from_col1}"
, column src
equals 'src1'.输出应该这样完成:1)从col1
取值以在输出中生成那么多行,其中 col opt
被重写, new_col1
等于f"{value_from_col0}_{loop_iterator_with_limit_from_col1}"
,列src
等于 'src1'。 2) Take value from col2
, split with |
2) 从col2
取值,用|
分割as a separator.作为分隔符。 For each split element, find it in the input df, take value from col0
and generate rows in a similar way as in 1).对于每个拆分元素,在输入 df 中找到它,从col0
获取值并以与 1) 中类似的方式生成行。 src
equals 'src2'. src
等于 'src2'。
df = pd.DataFrame([
['opt1', 'a', 2, ''],
['opt2', 'b', 1, ''],
['opt9', 'z', 3, 'a|b'],
['opt8', 'y', 3, 'a']],
columns=['opt', 'col0', 'col1', 'col2'])
out = pd.DataFrame()
new_rows = []
for i, row in df.iterrows():
for j in range(row['col1']):
new_row = dict()
new_row['opt'] = row['opt']
new_row['new_col'] = f"{row['col0']}_{j+1}"
new_row['src'] = 'src1'
new_rows.append(new_row)
for s in row['col2'].split('|'):
if s:
col1_value = df.loc[df['col0'] == s]['col1'].values[0]
for k in range(col1_value):
new_row = dict()
new_row['opt'] = row['opt']
new_row['new_col'] = f"{s}_{k + 1}"
new_row['src'] = 'src2'
new_rows.append(new_row)
out = out.append(new_rows, ignore_index=True)
Below you can find the expected output.您可以在下面找到预期的输出。 I used iterrows()
which is pretty slow.我使用了很慢的iterrows()
。 I believe there is a more efficient pandas' way to achieve same thing.我相信有一种更有效的熊猫方式来实现同样的目标。 Of course, it can be sorted in a different way, it doesn't matter.当然,它可以以不同的方式排序,没关系。
new_col opt src
0 a_1 opt1 src1
1 a_2 opt1 src1
2 b_1 opt2 src1
3 z_1 opt9 src1
4 z_2 opt9 src1
5 z_3 opt9 src1
6 a_1 opt9 src2
7 a_2 opt9 src2
8 b_1 opt9 src2
9 y_1 opt8 src1
10 y_2 opt8 src1
11 y_3 opt8 src1
12 a_1 opt8 src2
13 a_2 opt8 src2
This is one way to try to use more of vectorized pandas functions, specifically in pandas==0.25
.这是尝试使用更多矢量化熊猫函数的一种方法,特别是在pandas==0.25
。 Probably it still has room for improvement, but it shows some performance improvements vs. using iterrows
.可能它仍有改进的空间,但与使用iterrows
相比,它显示了一些性能改进。 The steps used are:使用的步骤是:
col2
by the split strings:通过拆分字符串分解col2
:col2
to col0
, merge back with df
and append to the original df;将col2
重命名为col0
,与df
合并并附加到原始 df ;repeat
to repeat each column by the number of col1
使用 pandas 或 numpy repeat
按col1
的数量重复每列Below in code:代码如下:
df['col2'] = df['col2'].str.split('|', n=-1, expand=False) #split string in col2
df['src'] = 'src1' #add src1 for original values
### Explode, change col names, merge and append.
df = pd.concat([
df.explode('col2')[['opt', 'col2']]\ #expand col2
.rename(columns={'col2': 'col0'})\ #rename to col0
.merge(df[['col0','col1']], on='col0'), #merge to get new col1
df], axis=0, sort=False).fillna('src2') #label second val to 'src2'
### Expand based on col1 values
new_df = pd.DataFrame(
pd.np.repeat(df.values,df['col1'],axis=0), columns=df.columns #repeat the values
).drop(['col1','col2'], axis=1)\
.sort_values(['opt','col0']).rename(columns={'col0':'new_col'})\
.reset_index(drop=True)
### Relabel new_col to append the order
new_df['new_col'] = new_df['new_col']+'_'+ \
(new_df.groupby(['opt','new_col']).cumcount()+1).map(str)
Out[1]:
opt new_col src
0 opt1 a_1 src1
1 opt1 a_2 src1
2 opt2 b_1 src1
3 opt8 a_1 src2
4 opt8 a_2 src2
5 opt8 y_1 src1
6 opt8 y_2 src1
7 opt8 y_3 src1
8 opt9 a_1 src2
9 opt9 a_2 src2
10 opt9 b_1 src2
11 opt9 z_1 src1
12 opt9 z_2 src1
13 opt9 z_3 src1
If we test the efficiency vs. iterrows
using 100 times this dataframe, we have below:如果我们使用这个数据帧的 100 倍来测试效率与iterrows
,我们有以下结果:
df = pd.concat([df]*100, ignore_index=True)
%timeit generic(df) #using iterrows (your function)
#162 ms ± 722 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit generic1(df) #using the code above
#33 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.