简体   繁体   中英

Data augmentation with pandas

I'm doing some data augmentation in my data.

Basically they look like this:

country.   size.   price.   product
CA.   1.   3.99.   12
US.   1.   2.99.   12
BR.   1.   10.99.  13

What I want to do is that because the size is fixed to 1, I want to add 3 more sizes per country, per product and increase the price accordingly. So, if the size is 2 then the price is price for 1 times 2, etc...

So basically, I'm looking for this:

country.   size.   price.   product
CA.   1.   3.99.   12
CA.   2.   7.98.   12
CA.   3.   11.97.   12
CA.   4.   15.96.   12
US.   1.   2.99.   12
US.   2.   5.98.   12
US.   3.   8.97.   12
US.   4.   11.96.   12
BR.   1.   10.99.  13
BR.   2.   21.98.  13
BR.   3.   32.97.  13
BR.   4.   43.96.  13

What is a good way to do this with pandas? I'm tried doing it in a loop with iterrows() but that wasn't a fast solution for my data. So am I missing something?

Use Index.repeat for add new rows, then aggregate GroupBy.cumsum and add counter by GroupBy.cumcount , last reset index for default unique one:

df = df.loc[df.index.repeat(4)]
df['size'] = df.groupby(level=0).cumcount().add(1)
df['price'] = df.groupby(level=0)['price'].cumsum()
df = df.reset_index(drop=True)
print (df)
   country  size  price  product
0       CA     1   3.99       12
1       CA     2   7.98       12
2       CA     3  11.97       12
3       CA     4  15.96       12
4       US     1   2.99       12
5       US     2   5.98       12
6       US     3   8.97       12
7       US     4  11.96       12
8       BR     1  10.99       13
9       BR     2  21.98       13
10      BR     3  32.97       13
11      BR     4  43.96       13

Another idea without cumcount , but with numpy.tile :

add = 3
df1 = df.loc[df.index.repeat(add + 1)]
df1['size'] = np.tile(range(1, add + 2), len(df))

df1['price'] = df1.groupby(level=0)['price'].cumsum()
df1 = df1.reset_index(drop=True)
print (df1)
   country  size  price  product
0       CA     1   3.99       12
1       CA     2   7.98       12
2       CA     3  11.97       12
3       CA     4  15.96       12
4       US     1   2.99       12
5       US     2   5.98       12
6       US     3   8.97       12
7       US     4  11.96       12
8       BR     1  10.99       13
9       BR     2  21.98       13
10      BR     3  32.97       13
11      BR     4  43.96       13

Construct 2 columns using assign and lambda:

s = np.tile(np.arange(4), df.shape[0])
df_final = df.loc[df.index.repeat(4)].assign(size=lambda x: x['size'] + s, 
                                             price=lambda x: x['price'] * (s+1))

Out[90]:
  country  size  price  product
0      CA   1.0   3.99       12
0      CA   2.0   7.98       12
0      CA   3.0  11.97       12
0      CA   4.0  15.96       12
1      US   1.0   2.99       12
1      US   2.0   5.98       12
1      US   3.0   8.97       12
1      US   4.0  11.96       12
2      BR   1.0  10.99       13
2      BR   2.0  21.98       13
2      BR   3.0  32.97       13
2      BR   4.0  43.96       13

Since the size is always 1, you basically only need to multiply size and price by a constant factor. You can do this straightforward, write the result into a seperate DataFrame and then use pd.concat to join them together

In [20]: df2 = pd.concat([df[['country.', 'product']], df[['size.', 'price.']] * 2], axis=1)                                                                                                                       

In [21]: pd.concat([df, df2])                                                                                                                                                                                      
Out[21]: 
  country.  size.  price.  product
0      CA.    1.0    3.99       12
1      US.    1.0    2.99       12
2      BR.    1.0   10.99       13
0      CA.    2.0    7.98       12
1      US.    2.0    5.98       12
2      BR.    2.0   21.98       13

To augment some more, simply loop over all desired prices:

In [22]: list_of_dfs = []                                                                                                                                                                                          

In [23]: list_of_dfs.append(df)                                                                                                                                                                                    

In [24]: for size in range(2,5): 
    ...:     list_of_dfs.append(pd.concat([df[['country.', 'product']], df[['size.', 'price.']] * size], axis=1)) 
    ...:                                                                                                                                                                                                           

In [25]: pd.concat(list_of_dfs)                                                                                                                                                                                    
Out[25]: 
  country.  size.  price.  product
0      CA.    1.0    3.99       12
1      US.    1.0    2.99       12
2      BR.    1.0   10.99       13
0      CA.    2.0    7.98       12
1      US.    2.0    5.98       12
2      BR.    2.0   21.98       13
0      CA.    3.0   11.97       12
1      US.    3.0    8.97       12
2      BR.    3.0   32.97       13
0      CA.    4.0   15.96       12
1      US.    4.0   11.96       12
2      BR.    4.0   43.96       13

This is a relatively naive approach, but should work fine in your case and makes good use of vectorization under the hood.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM