简体   繁体   English

如果列总和等于单个列,则删除行

[英]Dropping a row if sum of columns equals an individual column

I have a dataframe that looks like this: 我有一个看起来像这样的数据框:

 Id   Var1_Belgium   var1_France  var1_Germany
 x     1               2            0
 y     1               0            0
 z     0               2            0
 u     1               3            2
 v     1               0            4

What I want is to drop any row where I only observe information in one country. 我想要的是删除仅在一个国家/地区观察信息的任何行。 So if the value in all countries but one are equal to zero I want to omit the row. 因此,如果所有国家/地区中的值(除了一个等于零)都等于零,我想省略该行。 There are dozens of countries in the dataframe. 数据框中有几十个国家。

Another way to think about this problem is that if the sum of all the var1's is equal to an individual column of var1 the row should be dropped. 考虑此问题的另一种方法是,如果所有var1的总和等于var1的单个列,则应删除该行。 Not sure if this makes it easier. 不知道这是否使它更容易。

This is what should happen: 这是应该发生的情况:

 Id   Var1_Belgium   var1_France  var1_Germany
 x     1               2            0
 u     1               3            2
 v     1               0            4

So any row in which only 1 country has a non-zero value should be dropped. 因此,应该删除只有一个国家/地区具有非零值的任何行。

Note: there are more columns and variables than the ones above. 注意:列和变量比上面的多。

I'm trying to do this for a df with millions of observations, an efficient method would be best. 我正在尝试使用具有数百万个观测值的df做到这一点,最好是一种有效的方法。

you can use filter() for choosing only var1_ like columns and then use (r != 0).sum() condition - it will give you the sum of 0 (False) and 1 (True). 您可以使用filter()仅选择类似var1_列,然后使用(r != 0).sum()条件-它将得出0 (假)和1 (真)的总和。 So if the sum is greater than 1 - it means that more than one country had non-zero value: 因此,如果总和大于1则表示有多个国家/地区的值非零:

In [52]: df
Out[52]:
   Id  var1_Belgium  var1_France  var1_Germany
0   1             0            0           122
1   2             0          100           120
2   3           100            0             0
3   4             5            6             7
4   5            11           12            13

In [55]: df.filter(like='var1_').apply(lambda r: (r != 0), axis=1)
Out[55]:
  var1_Belgium var1_France var1_Germany
0        False       False         True
1        False        True         True
2         True       False        False
3         True        True         True
4         True        True         True


In [53]: df.filter(like='var1_').apply(lambda r: (r != 0).sum() > 1, axis=1)
Out[53]:
0    False
1     True
2    False
3     True
4     True
dtype: bool

Result 结果

In [54]: df[df.filter(like='var1_').apply(lambda r: (r != 0).sum() > 1, axis=1)]
Out[54]:
   Id  var1_Belgium  var1_France  var1_Germany
1   2             0          100           120
3   4             5            6             7
4   5            11           12            13

IIUC then I think this should work: IIUC然后我认为这应该起作用:

In [314]:
df[(df.ix[:,'Var1_Belgium':] == 0).sum(axis=1) < len(df.ix[:,'Var1_Belgium':].columns) - 1]

Out[314]:
  Id  Var1_Belgium  var1_France  var1_Germany
0  x             1            2             0
3  u             1            3             2
4  v             1            0             4

So this compares just the country columns against 0 and sum s them and compares this against the number of columns - 1 and masks the rows that meet the criteria/ 因此,这只会将国家/地区列与0进行比较并将它们sum然后与列数-1进行比较,并屏蔽符合条件/

Or simpler: 或更简单:

In [315]:
df[(df.ix[:,'Var1_Belgium':] != 0).sum(axis=1) >  1]

Out[315]:
  Id  Var1_Belgium  var1_France  var1_Germany
0  x             1            2             0
3  u             1            3             2
4  v             1            0             4

Maybe the simplest is use iloc for selection all columns without first: 也许最简单的方法是使用iloc选择所有列而无需先进行:

print df[(df.iloc[:,1:] != 0).sum(axis=1) > 1]

  Id  Var1_Belgium  var1_France  var1_Germany
0  x             1            2             0
3  u             1            3             2
4  v             1            0             4

And maybe the best is combinations EdChum and MaxU solutions: 也许最好的方法是结合EdChumMaxU解决方案:

print df[(df.filter(like='var1') != 0).sum(1) > 1]
  Id  var1_Belgium  var1_France  var1_Germany
0  x             1            2             0
3  u             1            3             2
4  v             1            0             4

Timings : 时间

df = pd.concat([df]*1000).reset_index(drop=True)

In [787]: %timeit df[df.filter(like='var1_').apply(lambda r: (r != 0).sum() > 1, axis=1)]
1 loops, best of 3: 746 ms per loop

In [788]: %timeit df[(df.ix[:,'Var1_Belgium':] != 0).sum(axis=1) >  1]
The slowest run took 4.49 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 1.39 ms per loop

In [789]: %timeit df[(df.filter(like='var1') != 0).sum(1) > 1]
The slowest run took 4.64 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 1.48 ms per loop

In [790]: %timeit df[(df.iloc[:,1:] != 0).sum(axis=1) > 1]
1000 loops, best of 3: 1.34 ms per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM