简体   繁体   中英

Dropping a row if sum of columns equals an individual column

I have a dataframe that looks like this:

 Id   Var1_Belgium   var1_France  var1_Germany
 x     1               2            0
 y     1               0            0
 z     0               2            0
 u     1               3            2
 v     1               0            4

What I want is to drop any row where I only observe information in one country. So if the value in all countries but one are equal to zero I want to omit the row. There are dozens of countries in the dataframe.

Another way to think about this problem is that if the sum of all the var1's is equal to an individual column of var1 the row should be dropped. Not sure if this makes it easier.

This is what should happen:

 Id   Var1_Belgium   var1_France  var1_Germany
 x     1               2            0
 u     1               3            2
 v     1               0            4

So any row in which only 1 country has a non-zero value should be dropped.

Note: there are more columns and variables than the ones above.

I'm trying to do this for a df with millions of observations, an efficient method would be best.

you can use filter() for choosing only var1_ like columns and then use (r != 0).sum() condition - it will give you the sum of 0 (False) and 1 (True). So if the sum is greater than 1 - it means that more than one country had non-zero value:

In [52]: df
Out[52]:
   Id  var1_Belgium  var1_France  var1_Germany
0   1             0            0           122
1   2             0          100           120
2   3           100            0             0
3   4             5            6             7
4   5            11           12            13

In [55]: df.filter(like='var1_').apply(lambda r: (r != 0), axis=1)
Out[55]:
  var1_Belgium var1_France var1_Germany
0        False       False         True
1        False        True         True
2         True       False        False
3         True        True         True
4         True        True         True


In [53]: df.filter(like='var1_').apply(lambda r: (r != 0).sum() > 1, axis=1)
Out[53]:
0    False
1     True
2    False
3     True
4     True
dtype: bool

Result

In [54]: df[df.filter(like='var1_').apply(lambda r: (r != 0).sum() > 1, axis=1)]
Out[54]:
   Id  var1_Belgium  var1_France  var1_Germany
1   2             0          100           120
3   4             5            6             7
4   5            11           12            13

IIUC then I think this should work:

In [314]:
df[(df.ix[:,'Var1_Belgium':] == 0).sum(axis=1) < len(df.ix[:,'Var1_Belgium':].columns) - 1]

Out[314]:
  Id  Var1_Belgium  var1_France  var1_Germany
0  x             1            2             0
3  u             1            3             2
4  v             1            0             4

So this compares just the country columns against 0 and sum s them and compares this against the number of columns - 1 and masks the rows that meet the criteria/

Or simpler:

In [315]:
df[(df.ix[:,'Var1_Belgium':] != 0).sum(axis=1) >  1]

Out[315]:
  Id  Var1_Belgium  var1_France  var1_Germany
0  x             1            2             0
3  u             1            3             2
4  v             1            0             4

Maybe the simplest is use iloc for selection all columns without first:

print df[(df.iloc[:,1:] != 0).sum(axis=1) > 1]

  Id  Var1_Belgium  var1_France  var1_Germany
0  x             1            2             0
3  u             1            3             2
4  v             1            0             4

And maybe the best is combinations EdChum and MaxU solutions:

print df[(df.filter(like='var1') != 0).sum(1) > 1]
  Id  var1_Belgium  var1_France  var1_Germany
0  x             1            2             0
3  u             1            3             2
4  v             1            0             4

Timings :

df = pd.concat([df]*1000).reset_index(drop=True)

In [787]: %timeit df[df.filter(like='var1_').apply(lambda r: (r != 0).sum() > 1, axis=1)]
1 loops, best of 3: 746 ms per loop

In [788]: %timeit df[(df.ix[:,'Var1_Belgium':] != 0).sum(axis=1) >  1]
The slowest run took 4.49 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 1.39 ms per loop

In [789]: %timeit df[(df.filter(like='var1') != 0).sum(1) > 1]
The slowest run took 4.64 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 1.48 ms per loop

In [790]: %timeit df[(df.iloc[:,1:] != 0).sum(axis=1) > 1]
1000 loops, best of 3: 1.34 ms per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM