从 dataframe 列中随机选择行

Question

For a given dataframe column, I would like to randomly select roughly 60% and add to a new column, add the remaining 40% to another column, multiply the 40% column by (-1), and create a new column that merges these back together like so:对于给定的 dataframe 列，我想随机将 select 大约 60% 添加到新列，将剩余的 40% 添加到另一列，将 40% 列乘以 (-1)，然后创建一个合并这些的新列像这样重新在一起：

dict0 = {'x1': [1,2,3,4,5,6]}
data = pd.DataFrame(dict0)### 

dict1 = {'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',2,'nan',4,'nan','nan']}
data = pd.DataFrame(dict1)### 


dict2 = {'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',-2,'nan',-4,'nan','nan']}
data = pd.DataFrame(dict2)### 

dict3 = {'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',-2,'nan',-   4,'nan','nan'],,'x4': [1,-2,3,-4,5,6]}
data = pd.DataFrame(dict3)###

Answer 1

If you don't need the intermediate columns:如果您不需要中间列：

mask = np.random.choice([1,-1], p=[0.6,0.4], size=len(data))

data['x4'] = data['x1']*mask

Of course the intermediate columns are easy as well:当然中间列也很简单：

data['x2'] = data['x1'].where(mask==1)

data['x3'] = data['x1'].mask(mask==1)
# or data['x3'] = data['x1'].where(mask==-1)

Answer 2

While the first answer proposes an elegant solution, it stretches the stated requirement to select roughly 60% of the rows.虽然第一个答案提出了一个优雅的解决方案，但它将规定的要求扩展到 select大约 60%的行。 The problem is that it doesn't guarantee a 60/40 distribution.问题是它不能保证 60/40 的分布。 Using probabilities, the selected samples could by chance easily be all 1 or all -1 , in effect selecting all or no rows, not roughly 60% .使用概率，所选样本可能很容易全部为1或全部为-1 ，实际上选择了所有行或没有行，而不是大约 60% 。

The chance of this to occur obviously decreases with larger dataframes, but it's never zero and is immediately visible when trying it with the provided example data.随着数据帧的增大，这种情况发生的可能性明显降低，但它永远不会为零，并且在使用提供的示例数据进行尝试时立即可见。

If this is relevant to you, take a look at this code, which does guarantee a 60/40 ratio of rows.如果这与您相关，请查看此代码，它确实保证了 60/40 的行比。

indices = np.random.choice(len(data), size=int(0.4 * len(data)), replace=False)
data['x4'] = np.where(data.index.isin(indices), -1 * data['x1'], data['x1'])

Update: One answer to your follow-up question proposes df.sample .更新：您的后续问题的一个答案提出df.sample 。 Indeed, it lets you express the above much more elegantly:事实上，它可以让你更优雅地表达上述内容：

indices = data.sample(frac=0.4).index
data['x4'] = np.where(data.index.isin(indices), -data['x1'], data['x1'])

从 dataframe 列中随机选择行

问题描述

2 个解决方案

解决方案1
2 2020-04-27 18:00:48

解决方案2
1 已采纳 2020-04-27 19:01:08

从 dataframe 列中随机选择行

问题描述

2 个解决方案

解决方案1 2 2020-04-27 18:00:48

解决方案2 1 已采纳 2020-04-27 19:01:08

解决方案1
2 2020-04-27 18:00:48

解决方案2
1 已采纳 2020-04-27 19:01:08