I have a DataFrame that looks like this
col1 col2 col3 col4 col5
0 0 1 0 1 1
1 0 1 0 0 1
I want to assign a unique positive integer greater than 1 to each 0 entry.
so I want a DataFrame that looks like this
col1 col2 col3 col4 col5
0 2 1 3 1 1
1 4 1 5 6 1
The integers don't have to be from an ordered sequence, just positive and unique.
np.arange(...).reshape(df.shape)
generates a dataframe the sive of df
consisting of continuous integers starting at 2.
df.where(df, ...)
works because your dataframe consists of binary indicators (zeros and ones). It keeps all true values (ie the ones) and then uses the continuous numpy array to fill in the zeros.
# optional: inplace=True
>>> df.where(df, np.arange(start=2, stop=df.shape[0] * df.shape[1] + 2).reshape(df.shape))
col1 col2 col3 col4 col5
0 2 1 4 1 1
1 7 1 9 10 1
I think you can use numpy.arange for generating unique random numbers with shape
and replace all 0
by boolean mask generating by df == 0
:
print df
col1 col2 col3 col4 col5
0 0 1 0 1 1
1 0 1 0 0 1
print df == 0
col1 col2 col3 col4 col5
0 True False True False False
1 True False True True False
print df.shape
(2, 5)
#count of integers
min_count = df.shape[0] * df.shape[1]
print min_count
10
#you need add 2, because omit 0 and 1
print np.arange(start=2, stop=min_count + 2).reshape(df.shape)
[[ 2 3 4 5 6]
[ 7 8 9 10 11]]
#use integers from 2 to max count of values of df
df[ df == 0 ] = np.arange(start=2, stop=min_count + 2).reshape(df.shape)
print df
col1 col2 col3 col4 col5
0 2 1 4 1 1
1 7 1 9 10 1
Or use numpy.random.choice for bigger unique random integers:
#count of integers
min_count = df.shape[0] * df.shape[1]
print min_count
10
#you can use bigger number in np.arange, e.g. 100, but minimal is min_count + 2
df[ df == 0 ] = np.random.choice(np.arange(2, 100), replace=False, size=df.shape)
print df
col1 col2 col3 col4 col5
0 17 1 53 1 1
1 39 1 15 76 1
This will work, although it isn't the greatest performance in pandas:
import random
MAX_INT = 100
for row in df:
for col in row:
if col == 0:
col == random.randrange(1, MAX_INT)
Something like itertuples()
will be faster, but if it's not a lot of data this is fine.
df[df == 0] = np.random.choice(np.arange(2, df.size + 2), replace=False, size=df.shape)
Lot of already good answers here but throwing this out there.
replace
indicates whether the sample is with or without replacement.
np.arange
is from ( 2
, size of the df + 2
). It's 2
because you want it greater than 1.
size
has to be the same shape as df
so I just used df.shape
To illustrate what array values np.random.choice
generates:
>>> np.random.choice(np.arange(2, df.size + 2), replace=False, size=df.shape)
array([[11, 4, 6, 5, 9],
[ 7, 8, 10, 3, 2]])
Note that they are all greater than 1 and are all unique.
Before:
col1 col2 col3 col4 col5
0 0 1 0 1 1
1 0 1 0 0 1
After:
col1 col2 col3 col4 col5
0 9 1 7 1 1
1 6 1 3 11 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.