基于另一个数据框生成熊猫数据框

Question

I need to generate a dataframe based on another one.我需要根据另一个数据帧生成一个数据帧。 There are two steps based on input df.基于输入df有两个步骤。

The input df has 4 columns.输入 df 有 4 列。 The output should be done this way: 1) Take value from col1 to generate that many rows in output, where col opt is rewritten, new_col1 equals f"{value_from_col0}_{loop_iterator_with_limit_from_col1}" , column src equals 'src1'.输出应该这样完成：1）从col1取值以在输出中生成那么多行，其中 col opt被重写， new_col1等于f"{value_from_col0}_{loop_iterator_with_limit_from_col1}" ，列src等于 'src1'。 2) Take value from col2 , split with | 2) 从col2取值，用|分割as a separator.作为分隔符。 For each split element, find it in the input df, take value from col0 and generate rows in a similar way as in 1).对于每个拆分元素，在输入 df 中找到它，从col0获取值并以与 1) 中类似的方式生成行。 src equals 'src2'. src等于 'src2'。

df = pd.DataFrame([
    ['opt1', 'a', 2, ''],
    ['opt2', 'b', 1, ''],
    ['opt9', 'z', 3, 'a|b'],
    ['opt8', 'y', 3, 'a']],
  columns=['opt', 'col0', 'col1', 'col2'])
out = pd.DataFrame()
new_rows = []
for i, row in df.iterrows():
    for j in range(row['col1']):
        new_row = dict()
        new_row['opt'] = row['opt']
        new_row['new_col'] = f"{row['col0']}_{j+1}"
        new_row['src'] = 'src1'
        new_rows.append(new_row)
    for s in row['col2'].split('|'):
        if s:
            col1_value = df.loc[df['col0'] == s]['col1'].values[0]
            for k in range(col1_value):
                new_row = dict()
                new_row['opt'] = row['opt']
                new_row['new_col'] = f"{s}_{k + 1}"
                new_row['src'] = 'src2'
                new_rows.append(new_row)
out = out.append(new_rows, ignore_index=True)

Below you can find the expected output.您可以在下面找到预期的输出。 I used iterrows() which is pretty slow.我使用了很慢的iterrows() 。 I believe there is a more efficient pandas' way to achieve same thing.我相信有一种更有效的熊猫方式来实现同样的目标。 Of course, it can be sorted in a different way, it doesn't matter.当然，它可以以不同的方式排序，没关系。

   new_col   opt   src
0      a_1  opt1  src1
1      a_2  opt1  src1
2      b_1  opt2  src1
3      z_1  opt9  src1
4      z_2  opt9  src1
5      z_3  opt9  src1
6      a_1  opt9  src2
7      a_2  opt9  src2
8      b_1  opt9  src2
9      y_1  opt8  src1
10     y_2  opt8  src1
11     y_3  opt8  src1
12     a_1  opt8  src2
13     a_2  opt8  src2

Answer 1

This is one way to try to use more of vectorized pandas functions, specifically in pandas==0.25 .这是尝试使用更多矢量化熊猫函数的一种方法，特别是在pandas==0.25 。 Probably it still has room for improvement, but it shows some performance improvements vs. using iterrows .可能它仍有改进的空间，但与使用iterrows相比，它显示了一些性能改进。 The steps used are:使用的步骤是：

Explode col2 by the split strings:通过拆分字符串分解col2 ：
Rename col2 to col0 , merge back with df and append to the original df;将col2重命名为col0 ，与df合并并附加到原始 df ；
Use pandas or numpy repeat to repeat each column by the number of col1使用 pandas 或 numpy repeat按col1的数量重复每列

Below in code:代码如下：

df['col2'] = df['col2'].str.split('|', n=-1, expand=False) #split string in col2
df['src'] = 'src1' #add src1 for original values

### Explode, change col names, merge and append.
df = pd.concat([
            df.explode('col2')[['opt', 'col2']]\ #expand col2
                .rename(columns={'col2': 'col0'})\ #rename to col0
                .merge(df[['col0','col1']], on='col0'), #merge to get new col1
        df], axis=0, sort=False).fillna('src2') #label second val to 'src2'

### Expand based on col1 values
new_df = pd.DataFrame(
            pd.np.repeat(df.values,df['col1'],axis=0), columns=df.columns #repeat the values
         ).drop(['col1','col2'], axis=1)\
         .sort_values(['opt','col0']).rename(columns={'col0':'new_col'})\
         .reset_index(drop=True)

### Relabel new_col to append the order
new_df['new_col'] = new_df['new_col']+'_'+ \
    (new_df.groupby(['opt','new_col']).cumcount()+1).map(str)


Out[1]:
    opt   new_col   src
0   opt1    a_1     src1
1   opt1    a_2     src1
2   opt2    b_1     src1
3   opt8    a_1     src2
4   opt8    a_2     src2
5   opt8    y_1     src1
6   opt8    y_2     src1
7   opt8    y_3     src1
8   opt9    a_1     src2
9   opt9    a_2     src2
10  opt9    b_1     src2
11  opt9    z_1     src1
12  opt9    z_2     src1
13  opt9    z_3     src1

If we test the efficiency vs. iterrows using 100 times this dataframe, we have below:如果我们使用这个数据帧的 100 倍来测试效率与iterrows ，我们有以下结果：

df = pd.concat([df]*100, ignore_index=True)

%timeit generic(df) #using iterrows (your function)
#162 ms ± 722 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit generic1(df) #using the code above
#33 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

基于另一个数据框生成熊猫数据框

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-08-01 22:04:16

基于另一个数据框生成熊猫数据框

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-08-01 22:04:16

解决方案1
1 已采纳 2019-08-01 22:04:16