简体   繁体   English

根据列值删除Pandas中的DataFrame行

[英]Deleting DataFrame row in Pandas based on column value

I have the following DataFrame:我有以下 DataFrame:

             daysago  line_race rating        rw    wrating
 line_date                                                 
 2007-03-31       62         11     56  1.000000  56.000000
 2007-03-10       83         11     67  1.000000  67.000000
 2007-02-10      111          9     66  1.000000  66.000000
 2007-01-13      139         10     83  0.880678  73.096278
 2006-12-23      160         10     88  0.793033  69.786942
 2006-11-09      204          9     52  0.636655  33.106077
 2006-10-22      222          8     66  0.581946  38.408408
 2006-09-29      245          9     70  0.518825  36.317752
 2006-09-16      258         11     68  0.486226  33.063381
 2006-08-30      275          8     72  0.446667  32.160051
 2006-02-11      475          5     65  0.164591  10.698423
 2006-01-13      504          0     70  0.142409   9.968634
 2006-01-02      515          0     64  0.134800   8.627219
 2005-12-06      542          0     70  0.117803   8.246238
 2005-11-29      549          0     70  0.113758   7.963072
 2005-11-22      556          0     -1  0.109852  -0.109852
 2005-11-01      577          0     -1  0.098919  -0.098919
 2005-10-20      589          0     -1  0.093168  -0.093168
 2005-09-27      612          0     -1  0.083063  -0.083063
 2005-09-07      632          0     -1  0.075171  -0.075171
 2005-06-12      719          0     69  0.048690   3.359623
 2005-05-29      733          0     -1  0.045404  -0.045404
 2005-05-02      760          0     -1  0.039679  -0.039679
 2005-04-02      790          0     -1  0.034160  -0.034160
 2005-03-13      810          0     -1  0.030915  -0.030915
 2004-11-09      934          0     -1  0.016647  -0.016647

I need to remove the rows where line_race is equal to 0 .我需要删除line_race等于0的行。 What's the most efficient way to do this?最有效的方法是什么?

如果我理解正确,它应该很简单:

df = df[df.line_race != 0]

But for any future bypassers you could mention that df = df[df.line_race != 0] doesn't do anything when trying to filter for None /missing values.但是对于任何未来的绕过者,您可以提到df = df[df.line_race != 0]在尝试过滤None /missing 值时不会做任何事情。

Does work:是否有效:

df = df[df.line_race != 0]

Doesn't do anything:什么都不做:

df = df[df.line_race != None]

Does work:是否有效:

df = df[df.line_race.notnull()]

只是为了添加另一个解决方案,如果您使用新的 pandas 评估器特别有用,其他解决方案将替换原来的 pandas 并失去评估器

df.drop(df.loc[df['line_race']==0].index, inplace=True)

If you want to delete rows based on multiple values of the column, you could use:如果要根据列的多个值删除行,可以使用:

df[(df.line_race != 0) & (df.line_race != 10)]

To drop all rows with values 0 and 10 for line_race .删除line_race值为 0 和 10 的所有行。

The best way to do this is with boolean masking:最好的方法是使用布尔掩码:

In [56]: df
Out[56]:
     line_date  daysago  line_race  rating    raw  wrating
0   2007-03-31       62         11      56  1.000   56.000
1   2007-03-10       83         11      67  1.000   67.000
2   2007-02-10      111          9      66  1.000   66.000
3   2007-01-13      139         10      83  0.881   73.096
4   2006-12-23      160         10      88  0.793   69.787
5   2006-11-09      204          9      52  0.637   33.106
6   2006-10-22      222          8      66  0.582   38.408
7   2006-09-29      245          9      70  0.519   36.318
8   2006-09-16      258         11      68  0.486   33.063
9   2006-08-30      275          8      72  0.447   32.160
10  2006-02-11      475          5      65  0.165   10.698
11  2006-01-13      504          0      70  0.142    9.969
12  2006-01-02      515          0      64  0.135    8.627
13  2005-12-06      542          0      70  0.118    8.246
14  2005-11-29      549          0      70  0.114    7.963
15  2005-11-22      556          0      -1  0.110   -0.110
16  2005-11-01      577          0      -1  0.099   -0.099
17  2005-10-20      589          0      -1  0.093   -0.093
18  2005-09-27      612          0      -1  0.083   -0.083
19  2005-09-07      632          0      -1  0.075   -0.075
20  2005-06-12      719          0      69  0.049    3.360
21  2005-05-29      733          0      -1  0.045   -0.045
22  2005-05-02      760          0      -1  0.040   -0.040
23  2005-04-02      790          0      -1  0.034   -0.034
24  2005-03-13      810          0      -1  0.031   -0.031
25  2004-11-09      934          0      -1  0.017   -0.017

In [57]: df[df.line_race != 0]
Out[57]:
     line_date  daysago  line_race  rating    raw  wrating
0   2007-03-31       62         11      56  1.000   56.000
1   2007-03-10       83         11      67  1.000   67.000
2   2007-02-10      111          9      66  1.000   66.000
3   2007-01-13      139         10      83  0.881   73.096
4   2006-12-23      160         10      88  0.793   69.787
5   2006-11-09      204          9      52  0.637   33.106
6   2006-10-22      222          8      66  0.582   38.408
7   2006-09-29      245          9      70  0.519   36.318
8   2006-09-16      258         11      68  0.486   33.063
9   2006-08-30      275          8      72  0.447   32.160
10  2006-02-11      475          5      65  0.165   10.698

UPDATE: Now that pandas 0.13 is out, another way to do this is df.query('line_race != 0') .更新:现在 pandas 0.13 已经发布,另一种方法是df.query('line_race != 0')

Though the previous answer are almost similar to what I am going to do, but using the index method does not require using another indexing method .loc().虽然前面的答案与我将要做的几乎相似,但使用索引方法不需要使用另一种索引方法 .loc()。 It can be done in a similar but precise manner as它可以以类似但精确的方式完成

df.drop(df.index[df['line_race'] == 0], inplace = True)

In case of multiple values and str dtype如果有多个值和 str dtype

I used the following to filter out given values in a col:我使用以下内容过滤掉col中的给定值:

def filter_rows_by_values(df, col, values):
    return df[~df[col].isin(values)]

Example:例子:

In a DataFrame I want to remove rows which have values "b" and "c" in column "str"在 DataFrame 中,我想删除列“str”中具有值“b”和“c”的行

df = pd.DataFrame({"str": ["a","a","a","a","b","b","c"], "other": [1,2,3,4,5,6,7]})
df
   str  other
0   a   1
1   a   2
2   a   3
3   a   4
4   b   5
5   b   6
6   c   7

filter_rows_by_values(df, "str", ["b","c"])

   str  other
0   a   1
1   a   2
2   a   3
3   a   4

The given answer is correct nontheless as someone above said you can use df.query('line_race != 0') which depending on your problem is much faster.尽管如此,给出的答案是正确的,因为上面有人说您可以使用df.query('line_race != 0') ,这取决于您的问题要快得多。 Highly recommend.强烈推荐。

一种高效且流行的方法是使用eq()方法:

df[~df.line_race.eq(0)]

Another way of doing it.另一种方法。 May not be the most efficient way as the code looks a bit more complex than the code mentioned in other answers, but still alternate way of doing the same thing.可能不是最有效的方法,因为代码看起来比其他答案中提到的代码更复杂,但仍然是做同样事情的替代方式。

  df = df.drop(df[df['line_race']==0].index)

添加另一种方法来做到这一点。

 df = df.query("line_race!=0")

I compiled and run my code.我编译并运行我的代码。 This is accurate code.这是准确的代码。 You can try it your own.你可以自己试试。

data = pd.read_excel('file.xlsx')

If you have any special character or space in column name you can write it in '' like in the given code:如果您在列名中有任何特殊字符或空格,您可以将其写在''中,就像在给定的代码中一样:

data = data[data['expire/t'].notnull()]
print (date)

If there is just a single string column name without any space or special character you can directly access it.如果只有一个字符串列名,没有任何空格或特殊字符,您可以直接访问它。

data = data[data.expire ! = 0]
print (date)

Just adding another way for DataFrame expanded over all columns:只需为扩展所有列的 DataFrame 添加另一种方式:

for column in df.columns:
   df = df[df[column]!=0]

Example:例子:

def z_score(data,count):
   threshold=3
   for column in data.columns:
       mean = np.mean(data[column])
       std = np.std(data[column])
       for i in data[column]:
           zscore = (i-mean)/std
           if(np.abs(zscore)>threshold):
               count=count+1
               data = data[data[column]!=i]
   return data,count

Just in case you need to delete the row, but the value can be in different columns.以防万一您需要删除该行,但该值可以位于不同的列中。 In my case I was using percentages so I wanted to delete the rows which has a value 1 in any column, since that means that it's the 100%就我而言,我使用的是百分比,所以我想删除任何列中值为 1 的行,因为这意味着它是 100%

for x in df:
    df.drop(df.loc[df[x]==1].index, inplace=True)

Is not optimal if your df have too many columns.如果您的 df 列太多,则不是最佳选择。

so many options provided(or maybe i didnt pay much attention to it, sorry if its the case), but no one mentioned this: we can use this notation in pandas: ~ (this gives us the inverse of the condition)提供了这么多选项(或者也许我没有太注意它,如果是这样的话抱歉),但是没有人提到这个:我们可以在 pandas 中使用这个符号:~(这给了我们条件的倒数)

df = df[~df["line_race"] == 0]

There are various ways to achieve that.有多种方法可以实现这一目标。 Will leave below various options, that one can use, depending on specificities of one's use case.将在下面留下各种选项,人们可以根据自己用例的具体情况使用这些选项。

One will consider that OP's dataframe is stored in the variable df .人们会认为 OP 的 dataframe 存储在变量df中。


Option 1选项1

For OP's case, considering that the only column with values 0 is the line_race , the following will do the work对于 OP 的情况,考虑到唯一值为0的列是line_race ,以下将完成工作

 df_new = df[df != 0].dropna()
 
[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

However, as that is not always the case, would recommend checking the following options where one will specify the column name.但是,情况并非总是如此,建议检查以下选项,其中将指定列名。


Option 2选项 2

tshauck's approach ends up being better than Option 1, because one is able to specify the column. tshauck 的方法最终比选项 1 更好,因为它能够指定列。 There are, however, additional variations depending on how one wants to refer to the column:但是,根据人们希望如何引用该专栏,还有其他变体:

For example, using the position in the dataframe例如,在 dataframe 中使用 position

df_new = df[df[df.columns[2]] != 0]

Or by explicitly indicating the column as follows或者通过如下显式指示列

df_new = df[df['line_race'] != 0]

One can also follow the same login but using a custom lambda function, such as也可以使用相同的登录名,但使用自定义 lambda function,例如

df_new = df[df.apply(lambda x: x['line_race'] != 0, axis=1)]

[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

Option 3选项 3

Using pandas.Series.map and a custom lambda function使用pandas.Series.map和自定义 lambda function

df_new = df['line_race'].map(lambda x: x != 0)

[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

Option 4选项 4

Using pandas.DataFrame.drop as follows使用pandas.DataFrame.drop如下

df_new = df.drop(df[df['line_race'] == 0].index)

[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

Option 5选项 5

Using pandas.DataFrame.query as follows使用pandas.DataFrame.query如下

df_new = df.query('line_race != 0')

[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

Option 6选项 6

Using pandas.DataFrame.drop and pandas.DataFrame.query as follows使用pandas.DataFrame.droppandas.DataFrame.query如下

df_new = df.drop(df.query('line_race == 0').index)

[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

Option 7选项 7

If one doesn't have strong opinions on the output, one can use a vectorized approach with numpy.select如果对 output 没有强烈的意见,可以使用矢量化方法numpy.select

df_new = np.select([df != 0], [df], default=np.nan)

[Out]:
[['2007-03-31' 62 11.0 56 1.0 56.0]
 ['2007-03-10' 83 11.0 67 1.0 67.0]
 ['2007-02-10' 111 9.0 66 1.0 66.0]
 ['2007-01-13' 139 10.0 83 0.880678 73.096278]
 ['2006-12-23' 160 10.0 88 0.793033 69.786942]
 ['2006-11-09' 204 9.0 52 0.636655 33.106077]
 ['2006-10-22' 222 8.0 66 0.581946 38.408408]
 ['2006-09-29' 245 9.0 70 0.518825 36.317752]
 ['2006-09-16' 258 11.0 68 0.486226 33.063381]
 ['2006-08-30' 275 8.0 72 0.446667 32.160051]
 ['2006-02-11' 475 5.0 65 0.164591 10.698423]]

This can also be converted to a dataframe with这也可以转换为 dataframe

df_new = pd.DataFrame(df_new, columns=df.columns)

[Out]:
     line_date daysago line_race rating        rw    wrating
0   2007-03-31      62      11.0     56       1.0       56.0
1   2007-03-10      83      11.0     67       1.0       67.0
2   2007-02-10     111       9.0     66       1.0       66.0
3   2007-01-13     139      10.0     83  0.880678  73.096278
4   2006-12-23     160      10.0     88  0.793033  69.786942
5   2006-11-09     204       9.0     52  0.636655  33.106077
6   2006-10-22     222       8.0     66  0.581946  38.408408
7   2006-09-29     245       9.0     70  0.518825  36.317752
8   2006-09-16     258      11.0     68  0.486226  33.063381
9   2006-08-30     275       8.0     72  0.446667  32.160051
10  2006-02-11     475       5.0     65  0.164591  10.698423

With regards to the most efficient solution, that would depend on how one wants to measure efficiency.关于最有效的解决方案,这将取决于人们希望如何衡量效率。 Assuming that one wants to measure the time of execution, one way that one can go about doing it is with time.perf_counter() .假设要测量执行时间,可以使用 go 的一种方法是使用time.perf_counter()

If one measures the time of execution for all the options above, one gets the following如果测量上述所有选项的执行时间,则会得到以下结果

       method                   time
0    Option 1 0.00000110000837594271
1  Option 2.1 0.00000139995245262980
2  Option 2.2 0.00000369996996596456
3  Option 2.3 0.00000160001218318939
4    Option 3 0.00000110000837594271
5    Option 4 0.00000120000913739204
6    Option 5 0.00000140001066029072
7    Option 6 0.00000159995397552848
8    Option 7 0.00000150001142174006

在此处输入图像描述

However, this might change depending on the dataframe one uses, on the requirements (such as hardware), and more.但是,这可能会根据使用的 dataframe、要求(例如硬件)等而改变。


Notes:笔记:

It doesn't make much difference for simple example like this, but for complicated logic, I prefer to use drop() when deleting rows because it is more straightforward than using inverse logic.对于像这样的简单示例并没有太大区别,但是对于复杂的逻辑,我更喜欢在删除行时使用drop() ,因为它比使用逆逻辑更简单。 For example, delete rows where A=1 AND (B=2 OR C=3) .例如,删除A=1 AND (B=2 OR C=3)的行。

Here's a scalable syntax that is easy to understand and can handle complicated logic:这是一种易于理解并且可以处理复杂逻辑的可扩展语法:

df.drop( df.query(" `line_race` == 0 ").index)

You can try using this:你可以尝试使用这个:

df.drop(df[df.line_race != 0].index, inplace = True)

. .

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM