[英]Deleting DataFrame row in Pandas based on column value
I have the following DataFrame:我有以下 DataFrame:
daysago line_race rating rw wrating
line_date
2007-03-31 62 11 56 1.000000 56.000000
2007-03-10 83 11 67 1.000000 67.000000
2007-02-10 111 9 66 1.000000 66.000000
2007-01-13 139 10 83 0.880678 73.096278
2006-12-23 160 10 88 0.793033 69.786942
2006-11-09 204 9 52 0.636655 33.106077
2006-10-22 222 8 66 0.581946 38.408408
2006-09-29 245 9 70 0.518825 36.317752
2006-09-16 258 11 68 0.486226 33.063381
2006-08-30 275 8 72 0.446667 32.160051
2006-02-11 475 5 65 0.164591 10.698423
2006-01-13 504 0 70 0.142409 9.968634
2006-01-02 515 0 64 0.134800 8.627219
2005-12-06 542 0 70 0.117803 8.246238
2005-11-29 549 0 70 0.113758 7.963072
2005-11-22 556 0 -1 0.109852 -0.109852
2005-11-01 577 0 -1 0.098919 -0.098919
2005-10-20 589 0 -1 0.093168 -0.093168
2005-09-27 612 0 -1 0.083063 -0.083063
2005-09-07 632 0 -1 0.075171 -0.075171
2005-06-12 719 0 69 0.048690 3.359623
2005-05-29 733 0 -1 0.045404 -0.045404
2005-05-02 760 0 -1 0.039679 -0.039679
2005-04-02 790 0 -1 0.034160 -0.034160
2005-03-13 810 0 -1 0.030915 -0.030915
2004-11-09 934 0 -1 0.016647 -0.016647
I need to remove the rows where line_race
is equal to 0
.我需要删除
line_race
等于0
的行。 What's the most efficient way to do this?最有效的方法是什么?
如果我理解正确,它应该很简单:
df = df[df.line_race != 0]
But for any future bypassers you could mention that df = df[df.line_race != 0]
doesn't do anything when trying to filter for None
/missing values.但是对于任何未来的绕过者,您可以提到
df = df[df.line_race != 0]
在尝试过滤None
/missing 值时不会做任何事情。
Does work:是否有效:
df = df[df.line_race != 0]
Doesn't do anything:什么都不做:
df = df[df.line_race != None]
Does work:是否有效:
df = df[df.line_race.notnull()]
只是为了添加另一个解决方案,如果您使用新的 pandas 评估器特别有用,其他解决方案将替换原来的 pandas 并失去评估器
df.drop(df.loc[df['line_race']==0].index, inplace=True)
If you want to delete rows based on multiple values of the column, you could use:如果要根据列的多个值删除行,可以使用:
df[(df.line_race != 0) & (df.line_race != 10)]
To drop all rows with values 0 and 10 for line_race
.删除
line_race
值为 0 和 10 的所有行。
The best way to do this is with boolean masking:最好的方法是使用布尔掩码:
In [56]: df
Out[56]:
line_date daysago line_race rating raw wrating
0 2007-03-31 62 11 56 1.000 56.000
1 2007-03-10 83 11 67 1.000 67.000
2 2007-02-10 111 9 66 1.000 66.000
3 2007-01-13 139 10 83 0.881 73.096
4 2006-12-23 160 10 88 0.793 69.787
5 2006-11-09 204 9 52 0.637 33.106
6 2006-10-22 222 8 66 0.582 38.408
7 2006-09-29 245 9 70 0.519 36.318
8 2006-09-16 258 11 68 0.486 33.063
9 2006-08-30 275 8 72 0.447 32.160
10 2006-02-11 475 5 65 0.165 10.698
11 2006-01-13 504 0 70 0.142 9.969
12 2006-01-02 515 0 64 0.135 8.627
13 2005-12-06 542 0 70 0.118 8.246
14 2005-11-29 549 0 70 0.114 7.963
15 2005-11-22 556 0 -1 0.110 -0.110
16 2005-11-01 577 0 -1 0.099 -0.099
17 2005-10-20 589 0 -1 0.093 -0.093
18 2005-09-27 612 0 -1 0.083 -0.083
19 2005-09-07 632 0 -1 0.075 -0.075
20 2005-06-12 719 0 69 0.049 3.360
21 2005-05-29 733 0 -1 0.045 -0.045
22 2005-05-02 760 0 -1 0.040 -0.040
23 2005-04-02 790 0 -1 0.034 -0.034
24 2005-03-13 810 0 -1 0.031 -0.031
25 2004-11-09 934 0 -1 0.017 -0.017
In [57]: df[df.line_race != 0]
Out[57]:
line_date daysago line_race rating raw wrating
0 2007-03-31 62 11 56 1.000 56.000
1 2007-03-10 83 11 67 1.000 67.000
2 2007-02-10 111 9 66 1.000 66.000
3 2007-01-13 139 10 83 0.881 73.096
4 2006-12-23 160 10 88 0.793 69.787
5 2006-11-09 204 9 52 0.637 33.106
6 2006-10-22 222 8 66 0.582 38.408
7 2006-09-29 245 9 70 0.519 36.318
8 2006-09-16 258 11 68 0.486 33.063
9 2006-08-30 275 8 72 0.447 32.160
10 2006-02-11 475 5 65 0.165 10.698
UPDATE: Now that pandas 0.13 is out, another way to do this is df.query('line_race != 0')
.更新:现在 pandas 0.13 已经发布,另一种方法是
df.query('line_race != 0')
。
Though the previous answer are almost similar to what I am going to do, but using the index method does not require using another indexing method .loc().虽然前面的答案与我将要做的几乎相似,但使用索引方法不需要使用另一种索引方法 .loc()。 It can be done in a similar but precise manner as
它可以以类似但精确的方式完成
df.drop(df.index[df['line_race'] == 0], inplace = True)
I used the following to filter out given values in a col:我使用以下内容过滤掉col中的给定值:
def filter_rows_by_values(df, col, values):
return df[~df[col].isin(values)]
Example:例子:
In a DataFrame I want to remove rows which have values "b" and "c" in column "str"在 DataFrame 中,我想删除列“str”中具有值“b”和“c”的行
df = pd.DataFrame({"str": ["a","a","a","a","b","b","c"], "other": [1,2,3,4,5,6,7]})
df
str other
0 a 1
1 a 2
2 a 3
3 a 4
4 b 5
5 b 6
6 c 7
filter_rows_by_values(df, "str", ["b","c"])
str other
0 a 1
1 a 2
2 a 3
3 a 4
The given answer is correct nontheless as someone above said you can use df.query('line_race != 0')
which depending on your problem is much faster.尽管如此,给出的答案是正确的,因为上面有人说您可以使用
df.query('line_race != 0')
,这取决于您的问题要快得多。 Highly recommend.强烈推荐。
一种高效且流行的方法是使用eq()
方法:
df[~df.line_race.eq(0)]
Another way of doing it.另一种方法。 May not be the most efficient way as the code looks a bit more complex than the code mentioned in other answers, but still alternate way of doing the same thing.
可能不是最有效的方法,因为代码看起来比其他答案中提到的代码更复杂,但仍然是做同样事情的替代方式。
df = df.drop(df[df['line_race']==0].index)
添加另一种方法来做到这一点。
df = df.query("line_race!=0")
I compiled and run my code.我编译并运行我的代码。 This is accurate code.
这是准确的代码。 You can try it your own.
你可以自己试试。
data = pd.read_excel('file.xlsx')
If you have any special character or space in column name you can write it in ''
like in the given code:如果您在列名中有任何特殊字符或空格,您可以将其写在
''
中,就像在给定的代码中一样:
data = data[data['expire/t'].notnull()]
print (date)
If there is just a single string column name without any space or special character you can directly access it.如果只有一个字符串列名,没有任何空格或特殊字符,您可以直接访问它。
data = data[data.expire ! = 0]
print (date)
Just adding another way for DataFrame expanded over all columns:只需为扩展所有列的 DataFrame 添加另一种方式:
for column in df.columns:
df = df[df[column]!=0]
Example:例子:
def z_score(data,count):
threshold=3
for column in data.columns:
mean = np.mean(data[column])
std = np.std(data[column])
for i in data[column]:
zscore = (i-mean)/std
if(np.abs(zscore)>threshold):
count=count+1
data = data[data[column]!=i]
return data,count
Just in case you need to delete the row, but the value can be in different columns.以防万一您需要删除该行,但该值可以位于不同的列中。 In my case I was using percentages so I wanted to delete the rows which has a value 1 in any column, since that means that it's the 100%
就我而言,我使用的是百分比,所以我想删除任何列中值为 1 的行,因为这意味着它是 100%
for x in df:
df.drop(df.loc[df[x]==1].index, inplace=True)
Is not optimal if your df have too many columns.如果您的 df 列太多,则不是最佳选择。
so many options provided(or maybe i didnt pay much attention to it, sorry if its the case), but no one mentioned this: we can use this notation in pandas: ~ (this gives us the inverse of the condition)提供了这么多选项(或者也许我没有太注意它,如果是这样的话抱歉),但是没有人提到这个:我们可以在 pandas 中使用这个符号:~(这给了我们条件的倒数)
df = df[~df["line_race"] == 0]
There are various ways to achieve that.有多种方法可以实现这一目标。 Will leave below various options, that one can use, depending on specificities of one's use case.
将在下面留下各种选项,人们可以根据自己用例的具体情况使用这些选项。
One will consider that OP's dataframe is stored in the variable df
.人们会认为 OP 的 dataframe 存储在变量
df
中。
Option 1选项1
For OP's case, considering that the only column with values 0
is the line_race
, the following will do the work对于 OP 的情况,考虑到唯一值为
0
的列是line_race
,以下将完成工作
df_new = df[df != 0].dropna()
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
However, as that is not always the case, would recommend checking the following options where one will specify the column name.但是,情况并非总是如此,建议检查以下选项,其中将指定列名。
Option 2选项 2
tshauck's approach ends up being better than Option 1, because one is able to specify the column. tshauck 的方法最终比选项 1 更好,因为它能够指定列。 There are, however, additional variations depending on how one wants to refer to the column:
但是,根据人们希望如何引用该专栏,还有其他变体:
For example, using the position in the dataframe例如,在 dataframe 中使用 position
df_new = df[df[df.columns[2]] != 0]
Or by explicitly indicating the column as follows或者通过如下显式指示列
df_new = df[df['line_race'] != 0]
One can also follow the same login but using a custom lambda function, such as也可以使用相同的登录名,但使用自定义 lambda function,例如
df_new = df[df.apply(lambda x: x['line_race'] != 0, axis=1)]
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 3选项 3
Using pandas.Series.map
and a custom lambda function使用
pandas.Series.map
和自定义 lambda function
df_new = df['line_race'].map(lambda x: x != 0)
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 4选项 4
Using pandas.DataFrame.drop
as follows使用
pandas.DataFrame.drop
如下
df_new = df.drop(df[df['line_race'] == 0].index)
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 5选项 5
Using pandas.DataFrame.query
as follows使用
pandas.DataFrame.query
如下
df_new = df.query('line_race != 0')
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 6选项 6
Using pandas.DataFrame.drop
and pandas.DataFrame.query
as follows使用
pandas.DataFrame.drop
和pandas.DataFrame.query
如下
df_new = df.drop(df.query('line_race == 0').index)
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 7选项 7
If one doesn't have strong opinions on the output, one can use a vectorized approach with numpy.select
如果对 output 没有强烈的意见,可以使用矢量化方法
numpy.select
df_new = np.select([df != 0], [df], default=np.nan)
[Out]:
[['2007-03-31' 62 11.0 56 1.0 56.0]
['2007-03-10' 83 11.0 67 1.0 67.0]
['2007-02-10' 111 9.0 66 1.0 66.0]
['2007-01-13' 139 10.0 83 0.880678 73.096278]
['2006-12-23' 160 10.0 88 0.793033 69.786942]
['2006-11-09' 204 9.0 52 0.636655 33.106077]
['2006-10-22' 222 8.0 66 0.581946 38.408408]
['2006-09-29' 245 9.0 70 0.518825 36.317752]
['2006-09-16' 258 11.0 68 0.486226 33.063381]
['2006-08-30' 275 8.0 72 0.446667 32.160051]
['2006-02-11' 475 5.0 65 0.164591 10.698423]]
This can also be converted to a dataframe with这也可以转换为 dataframe
df_new = pd.DataFrame(df_new, columns=df.columns)
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.0 56.0
1 2007-03-10 83 11.0 67 1.0 67.0
2 2007-02-10 111 9.0 66 1.0 66.0
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
With regards to the most efficient solution, that would depend on how one wants to measure efficiency.关于最有效的解决方案,这将取决于人们希望如何衡量效率。 Assuming that one wants to measure the time of execution, one way that one can go about doing it is with
time.perf_counter()
.假设要测量执行时间,可以使用 go 的一种方法是使用
time.perf_counter()
。
If one measures the time of execution for all the options above, one gets the following如果测量上述所有选项的执行时间,则会得到以下结果
method time
0 Option 1 0.00000110000837594271
1 Option 2.1 0.00000139995245262980
2 Option 2.2 0.00000369996996596456
3 Option 2.3 0.00000160001218318939
4 Option 3 0.00000110000837594271
5 Option 4 0.00000120000913739204
6 Option 5 0.00000140001066029072
7 Option 6 0.00000159995397552848
8 Option 7 0.00000150001142174006
However, this might change depending on the dataframe one uses, on the requirements (such as hardware), and more.但是,这可能会根据使用的 dataframe、要求(例如硬件)等而改变。
Notes:笔记:
There are various suggestions on using inplace=True
.关于使用
inplace=True
有各种建议。 Would suggest reading this: https://stackoverflow.com/a/59242208/7109869建议阅读: https://stackoverflow.com/a/59242208/7109869
There are also some people with strong opinions on .apply()
.也有一些人对
.apply()
有强烈的意见。 Would suggest reading this: When should I (not) want to use pandas apply() in my code?建议阅读以下内容:我什么时候应该(不)想在我的代码中使用 pandas apply()?
If one has missing values, one might want to consider as well pandas.DataFrame.dropna
.如果有缺失值,可能还需要考虑
pandas.DataFrame.dropna
。 Using the option 2, it would be something like使用选项 2,它会是这样的
df = df[df['line_race'].= 0].dropna()
There are additional ways to measure the time of execution, so I would recommend this thread: How do I get time of a Python program's execution?还有其他方法可以测量执行时间,所以我会推荐这个线程: How do I get time of a Python program's execution?
It doesn't make much difference for simple example like this, but for complicated logic, I prefer to use drop()
when deleting rows because it is more straightforward than using inverse logic.对于像这样的简单示例并没有太大区别,但是对于复杂的逻辑,我更喜欢在删除行时使用
drop()
,因为它比使用逆逻辑更简单。 For example, delete rows where A=1 AND (B=2 OR C=3)
.例如,删除
A=1 AND (B=2 OR C=3)
的行。
Here's a scalable syntax that is easy to understand and can handle complicated logic:这是一种易于理解并且可以处理复杂逻辑的可扩展语法:
df.drop( df.query(" `line_race` == 0 ").index)
You can try using this:你可以尝试使用这个:
df.drop(df[df.line_race != 0].index, inplace = True)
. .
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.