简体   繁体   中英

Randomly selecting rows from dataframe column

For a given dataframe column, I would like to randomly select roughly 60% and add to a new column, add the remaining 40% to another column, multiply the 40% column by (-1), and create a new column that merges these back together like so:

dict0 = {'x1': [1,2,3,4,5,6]}
data = pd.DataFrame(dict0)### 

dict1 = {'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',2,'nan',4,'nan','nan']}
data = pd.DataFrame(dict1)### 


dict2 = {'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',-2,'nan',-4,'nan','nan']}
data = pd.DataFrame(dict2)### 

dict3 = {'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',-2,'nan',-   4,'nan','nan'],,'x4': [1,-2,3,-4,5,6]}
data = pd.DataFrame(dict3)### 

If you don't need the intermediate columns:

mask = np.random.choice([1,-1], p=[0.6,0.4], size=len(data))

data['x4'] = data['x1']*mask

Of course the intermediate columns are easy as well:

data['x2'] = data['x1'].where(mask==1)

data['x3'] = data['x1'].mask(mask==1)
# or data['x3'] = data['x1'].where(mask==-1)

While the first answer proposes an elegant solution, it stretches the stated requirement to select roughly 60% of the rows. The problem is that it doesn't guarantee a 60/40 distribution. Using probabilities, the selected samples could by chance easily be all 1 or all -1 , in effect selecting all or no rows, not roughly 60% .

The chance of this to occur obviously decreases with larger dataframes, but it's never zero and is immediately visible when trying it with the provided example data.

If this is relevant to you, take a look at this code, which does guarantee a 60/40 ratio of rows.

indices = np.random.choice(len(data), size=int(0.4 * len(data)), replace=False)
data['x4'] = np.where(data.index.isin(indices), -1 * data['x1'], data['x1'])

Update: One answer to your follow-up question proposes df.sample . Indeed, it lets you express the above much more elegantly:

indices = data.sample(frac=0.4).index
data['x4'] = np.where(data.index.isin(indices), -data['x1'], data['x1'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM