[英]Duplicating rows in a DataFrame based on column value
Below is a set of sample data I am working with:以下是我正在使用的一组示例数据:
sample_dat = pd.DataFrame(
np.array([[1,0,1,1,1,5],
[0,0,0,0,1,3],
[1,0,0,0,1,1],
[1,0,0,1,1,1],
[1,0,0,0,1,1],
[1,1,0,0,1,1]]),
columns=['var1','var2','var3','var4','var5','cnt']
)
I need to change the data so the rows are duplicated according to the value in the last column.我需要更改数据,以便根据最后一列中的值复制行。 Specifically I wish for it to do be duplicated based on the value in the cnt
column.具体来说,我希望它根据cnt
列中的值进行复制。
My search yielded lots of stuff about melts, splits, and other stuff.我的搜索产生了很多关于融化、分裂和其他东西的东西。 I think what I am looking for is very basic, hopefully.我认为我正在寻找的是非常基本的,希望如此。 Please also note that I will likely have some kind of an id in the first column that will be either an integer or string.另请注意,我可能会在第一列中使用某种类型的 id,它可以是整数或字符串。
For example, the first record will be duplicated 4 more times.例如,第一条记录将再重复 4 次。 The second record will be duplicated twice more.第二个记录将再复制两次。
An example of what the DataFrame
would look like if I were manually doing it with syntax is below:如果我使用语法手动执行DataFrame
外观示例如下:
sample_dat2 = pd.DataFrame(
np.array([[1,0,1,1,1,5],
[1,0,1,1,1,5],
[1,0,1,1,1,5],
[1,0,1,1,1,5],
[1,0,1,1,1,5],
[0,0,0,0,1,3],
[0,0,0,0,1,3],
[0,0,0,0,1,3],
[1,0,0,0,1,1],
[1,0,0,1,1,1],
[1,0,0,0,1,1],
[1,1,0,0,1,1]]),
columns=['var1','var2','var3','var4','var5','cnt']
)
Create an empty dataframe then iterate over your data, appending each row to the new dataframe x amount of times where x is the number in the 'cnt' column.创建一个空数据框,然后遍历您的数据,将每一行附加到新数据框 x 次,其中 x 是“cnt”列中的数字。
df =pd.DataFrame()
for index, row in sample_dat.iterrows():
for x in range(row['cnt']):
df = df.append(row, ignore_index=True)
>>> df
cnt var1 var2 var3 var4 var5
0 5.0 1.0 0.0 1.0 1.0 1.0
0 5.0 1.0 0.0 1.0 1.0 1.0
0 5.0 1.0 0.0 1.0 1.0 1.0
0 5.0 1.0 0.0 1.0 1.0 1.0
0 5.0 1.0 0.0 1.0 1.0 1.0
1 3.0 0.0 0.0 0.0 0.0 1.0
1 3.0 0.0 0.0 0.0 0.0 1.0
1 3.0 0.0 0.0 0.0 0.0 1.0
2 1.0 1.0 0.0 0.0 0.0 1.0
3 1.0 1.0 0.0 0.0 1.0 1.0
4 1.0 1.0 0.0 0.0 0.0 1.0
5 1.0 1.0 1.0 0.0 0.0 1.0
I will use numpy repeat based on the dataframe index location.我将根据数据帧索引位置使用 numpy repeat。 Then reset the index.然后重置索引。
sample_dat.loc[numpy.repeat(sample_dat.index.values, sample_dat.cnt)].reset_index(drop=True)
Result:结果:
var1 var2 var3 var4 var5 cnt
0 1 0 1 1 1 5
1 1 0 1 1 1 5
2 1 0 1 1 1 5
3 1 0 1 1 1 5
4 1 0 1 1 1 5
5 0 0 0 0 1 3
6 0 0 0 0 1 3
7 0 0 0 0 1 3
8 1 0 0 0 1 1
9 1 0 0 1 1 1
10 1 0 0 0 1 1
11 1 1 0 0 1 1
You can use numpy.repeat
along with indexing to return an array of values from the column that determines the number of repetitions.您可以将numpy.repeat
与索引一起使用,以从确定重复次数的列中返回一组值。
import numpy as np
import pandas as pd
arr = np.array(
[[1,0,1,1,1,5],
[0,0,0,0,1,3],
[1,0,0,0,1,1],
[1,0,0,1,1,1],
[1,0,0,0,1,1],
[1,1,0,0,1,1]]
)
df = pd.DataFrame(
np.repeat(arr, arr[:,5], axis=0),
columns=['var1','var2','var3','var4','var5','cnt']
)
print(df)
# var1 var2 var3 var4 var5 cnt
# 0 1 0 1 1 1 5
# 1 1 0 1 1 1 5
# 2 1 0 1 1 1 5
# 3 1 0 1 1 1 5
# 4 1 0 1 1 1 5
# 5 0 0 0 0 1 3
# 6 0 0 0 0 1 3
# 7 0 0 0 0 1 3
# 8 1 0 0 0 1 1
# 9 1 0 0 1 1 1
# 10 1 0 0 0 1 1
# 11 1 1 0 0 1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.