简体   繁体   中英

Filling DataFrame with unique positive integers

I have a DataFrame that looks like this

   col1 col2 col3  col4 col5  
 0   0   1    0     1     1
 1   0   1    0     0     1

I want to assign a unique positive integer greater than 1 to each 0 entry.

so I want a DataFrame that looks like this

      col1 col2  col3  col4 col5    
    0  2    1     3     1    1
    1  4    1     5     6    1

The integers don't have to be from an ordered sequence, just positive and unique.

np.arange(...).reshape(df.shape) generates a dataframe the sive of df consisting of continuous integers starting at 2.

df.where(df, ...) works because your dataframe consists of binary indicators (zeros and ones). It keeps all true values (ie the ones) and then uses the continuous numpy array to fill in the zeros.

# optional: inplace=True
>>> df.where(df, np.arange(start=2, stop=df.shape[0] * df.shape[1] + 2).reshape(df.shape))  
   col1  col2  col3  col4  col5
0     2     1     4     1     1
1     7     1     9    10     1

I think you can use numpy.arange for generating unique random numbers with shape and replace all 0 by boolean mask generating by df == 0 :

print df
   col1  col2  col3  col4  col5
0     0     1     0     1     1
1     0     1     0     0     1

print df == 0
   col1   col2  col3   col4   col5
0  True  False  True  False  False
1  True  False  True   True  False

print df.shape
(2, 5)

#count of integers
min_count = df.shape[0] * df.shape[1]
print min_count
10

#you need add 2, because omit 0 and 1
print np.arange(start=2, stop=min_count + 2).reshape(df.shape)
[[ 2  3  4  5  6]
 [ 7  8  9 10 11]]

#use integers from 2 to max count of values of df
df[ df == 0 ] = np.arange(start=2, stop=min_count + 2).reshape(df.shape)
print df
   col1  col2  col3  col4  col5
0     2     1     4     1     1
1     7     1     9    10     1

Or use numpy.random.choice for bigger unique random integers:

#count of integers
min_count = df.shape[0] * df.shape[1]
print min_count
10
#you can use bigger number in np.arange, e.g. 100, but minimal is min_count + 2
df[ df == 0 ] = np.random.choice(np.arange(2, 100), replace=False, size=df.shape)
print df
   col1  col2  col3  col4  col5
0    17     1    53     1     1
1    39     1    15    76     1

This will work, although it isn't the greatest performance in pandas:

import random

MAX_INT = 100

for row in df:
    for col in row:
        if col == 0:
            col == random.randrange(1, MAX_INT)

Something like itertuples() will be faster, but if it's not a lot of data this is fine.

df[df == 0] = np.random.choice(np.arange(2, df.size + 2), replace=False, size=df.shape)

Lot of already good answers here but throwing this out there.

  1. replace indicates whether the sample is with or without replacement.

  2. np.arange is from ( 2 , size of the df + 2 ). It's 2 because you want it greater than 1.

  3. size has to be the same shape as df so I just used df.shape

To illustrate what array values np.random.choice generates:

>>> np.random.choice(np.arange(2, df.size + 2), replace=False, size=df.shape)
array([[11,  4,  6,  5,  9],
       [ 7,  8, 10,  3,  2]])

Note that they are all greater than 1 and are all unique.

Before:

   col1  col2  col3  col4  col5
0     0     1     0     1     1
1     0     1     0     0     1

After:

   col1  col2  col3  col4  col5
0     9     1     7     1     1
1     6     1     3    11     1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM