简体   繁体   English

如何删除Pandas DataFrame某列值为NaN的行

[英]How to drop rows of Pandas DataFrame whose value in a certain column is NaN

I have this DataFrame and want only the records whose EPS column is not NaN :我有这个DataFrame并且只想要EPS列不是NaN的记录:

>>> df
                 STK_ID  EPS  cash
STK_ID RPT_Date                   
601166 20111231  601166  NaN   NaN
600036 20111231  600036  NaN    12
600016 20111231  600016  4.3   NaN
601009 20111231  601009  NaN   NaN
601939 20111231  601939  2.5   NaN
000001 20111231  000001  NaN   NaN

...ie something like df.drop(....) to get this resulting dataframe: ...即像df.drop(....)这样的结果 dataframe:

                  STK_ID  EPS  cash
STK_ID RPT_Date                   
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

How do I do that?我怎么做?

不要放弃,只取 EPS 不是 NA 的行:

df = df[df['EPS'].notna()]

This question is already resolved, but...这个问题已经解决了,但是...

...also consider the solution suggested by Wouter in his original comment . ...还要考虑 Wouter 在其原始评论中建议的解决方案。 The ability to handle missing data, including dropna() , is built into pandas explicitly.处理丢失数据的能力,包括dropna() ,明确地内置在 pandas 中。 Aside from potentially improved performance over doing it manually, these functions also come with a variety of options which may be useful.除了与手动操作相比可能会提高性能外,这些功能还带有各种可能有用的选项。

In [24]: df = pd.DataFrame(np.random.randn(10,3))

In [25]: df.iloc[::2,0] = np.nan; df.iloc[::4,1] = np.nan; df.iloc[::3,2] = np.nan;

In [26]: df
Out[26]:
          0         1         2
0       NaN       NaN       NaN
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [27]: df.dropna()     #drop all rows that have any NaN values
Out[27]:
          0         1         2
1  2.677677 -1.466923 -0.750366
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295

In [28]: df.dropna(how='all')     #drop only if ALL columns are NaN
Out[28]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [29]: df.dropna(thresh=2)   #Drop row if it does not have at least two values that are **not** NaN
Out[29]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

In [30]: df.dropna(subset=[1])   #Drop only if NaN in specific column (as asked in the question)
Out[30]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

There are also other options (See docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html ), including dropping columns instead of rows.还有其他选项(请参阅http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html上的文档),包括删除列而不是行。

Pretty handy!很方便!

I know this has already been answered, but just for the sake of a purely pandas solution to this specific question as opposed to the general description from Aman (which was wonderful) and in case anyone else happens upon this:我知道这已经得到了回答,但只是为了这个特定问题的纯粹熊猫解决方案,而不是 Aman 的一般描述(这很棒),以防其他人发生这种情况:

import pandas as pd
df = df[pd.notnull(df['EPS'])]

你可以使用这个:

df.dropna(subset=['EPS'], how='all', inplace=True)

How to drop rows of Pandas DataFrame whose value in a certain column is NaN如何删除某一列中值为 NaN 的 Pandas DataFrame 行

This is an old question which has been beaten to death but I do believe there is some more useful information to be surfaced on this thread.这是一个老问题,已经被打死了,但我相信在这个线程上有一些更有用的信息可以浮出水面。 Read on if you're looking for the answer to any of the following questions:如果您正在寻找以下任何问题的答案,请继续阅读:

  • Can I drop rows if any of its values have NaNs?如果行的任何值具有 NaN,我可以删除行吗? What about if all of them are NaN?如果它们都是 NaN 怎么办?
  • Can I only look at NaNs in specific columns when dropping rows?删除行时,我只能查看特定列中的 NaN 吗?
  • Can I drop rows with a specific count of NaN values?我可以删除具有特定 NaN 值计数的行吗?
  • How do I drop columns instead of rows?如何删除列而不是行?
  • I tried all of the options above but my DataFrame just won't update!我尝试了上述所有选项,但我的 DataFrame 不会更新!

DataFrame.dropna : Usage, and Examples DataFrame.dropna :用法和示例

It's already been said that df.dropna is the canonical method to drop NaNs from DataFrames, but there's nothing like a few visual cues to help along the way.已经有人说df.dropna是从 DataFrame 中删除 NaN 的规范方法,但是在此过程中没有什么比一些视觉提示更能提供帮助的了。

# Setup
df = pd.DataFrame({
    'A': [np.nan, 2, 3, 4],  
    'B': [np.nan, np.nan, 2, 3], 
    'C': [np.nan]*3 + [3]}) 

df                      
     A    B    C
0  NaN  NaN  NaN
1  2.0  NaN  NaN
2  3.0  2.0  NaN
3  4.0  3.0  3.0

Below is a detail of the most important arguments and how they work, arranged in an FAQ format.以下是最重要的论点及其工作原理的详细信息,以常见问题解答格式排列。


Can I drop rows if any of its values have NaNs?如果行的任何值具有 NaN,我可以删除行吗? What about if all of them are NaN?如果它们都是 NaN 怎么办?

This is where the how=... argument comes in handy.这就是how=...参数派上用场的地方。 It can be one of它可以是其中之一

  • 'any' (default) - drops rows if at least one column has NaN 'any' (默认) - 如果至少一列有 NaN,则删除行
  • 'all' - drops rows only if all of its columns have NaNs 'all' - 仅当所有列都有 NaN 时才删除行

<!_ -> <!_ ->

# Removes all but the last row since there are no NaNs 
df.dropna()

     A    B    C
3  4.0  3.0  3.0

# Removes the first row only
df.dropna(how='all')

     A    B    C
1  2.0  NaN  NaN
2  3.0  2.0  NaN
3  4.0  3.0  3.0

Note笔记
If you just want to see which rows are null (IOW, if you want a boolean mask of rows), use isna :如果您只想查看哪些行为空(IOW,如果您想要行的布尔掩码),请使用isna

 df.isna() ABC 0 True True True 1 False True True 2 False False True 3 False False False df.isna().any(axis=1) 0 True 1 True 2 True 3 False dtype: bool

To get the inversion of this result, use notna instead.要获得此结果的反转,请改用notna


Can I only look at NaNs in specific columns when dropping rows?删除行时,我只能查看特定列中的 NaN 吗?

This is a use case for the subset=[...] argument.这是subset=[...]参数的用例。

Specify a list of columns (or indexes with axis=1 ) to tells pandas you only want to look at these columns (or rows with axis=1 ) when dropping rows (or columns with axis=1 .指定一个列列表(或带有axis=1的索引)告诉熊猫您在删除行(或带有axis=1的列时只想查看这些列(或带有axis=1的行)。

# Drop all rows with NaNs in A
df.dropna(subset=['A'])

     A    B    C
1  2.0  NaN  NaN
2  3.0  2.0  NaN
3  4.0  3.0  3.0

# Drop all rows with NaNs in A OR B
df.dropna(subset=['A', 'B'])

     A    B    C
2  3.0  2.0  NaN
3  4.0  3.0  3.0

Can I drop rows with a specific count of NaN values?我可以删除具有特定 NaN 值计数的行吗?

This is a use case for the thresh=... argument.这是thresh=...参数的一个用例。 Specify the minimum number of NON-NULL values as an integer.将 NON-NULL 值的最小数量指定为整数。

df.dropna(thresh=1)  

     A    B    C
1  2.0  NaN  NaN
2  3.0  2.0  NaN
3  4.0  3.0  3.0

df.dropna(thresh=2)

     A    B    C
2  3.0  2.0  NaN
3  4.0  3.0  3.0

df.dropna(thresh=3)

     A    B    C
3  4.0  3.0  3.0

The thing to note here is you need to specify how many NON-NULL values you want to keep , rather than how many NULL values you want to drop .这里要注意的是,您需要指定要保留多少个 NON-NULL 值,而不是要删除多少个 NULL 值。 This is a pain point for new users.这是新用户的痛点。

Luckily the fix is easy: if you have a count of NULL values, simply subtract it from the column size to get the correct thresh argument for the function.幸运的是,修复很简单:如果您有 NULL 值的计数,只需从列大小中减去它即可获得函数的正确 thresh 参数。

required_min_null_values_to_drop = 2 # drop rows with at least 2 NaN
df.dropna(thresh=df.shape[1] - required_min_null_values_to_drop + 1)

     A    B    C
2  3.0  2.0  NaN
3  4.0  3.0  3.0

How do I drop columns instead of rows?如何删除列而不是行?

Use the axis=... argument, it can be axis=0 or axis=1 .使用axis=...参数,它可以是axis=0axis=1

Tells the function whether you want to drop rows ( axis=0 ) or drop columns ( axis=1 ).告诉函数您是要删除行( axis=0 )还是删除列( axis=1 )。

df.dropna()

     A    B    C
3  4.0  3.0  3.0

# All columns have rows, so the result is empty.
df.dropna(axis=1)

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]

# Here's a different example requiring the column to have all NaN rows
# to be dropped. In this case no columns satisfy the condition.
df.dropna(axis=1, how='all')

     A    B    C
0  NaN  NaN  NaN
1  2.0  NaN  NaN
2  3.0  2.0  NaN
3  4.0  3.0  3.0

# Here's a different example requiring a column to have at least 2 NON-NULL
# values. Column C has less than 2 NON-NULL values, so it should be dropped.
df.dropna(axis=1, thresh=2)

     A    B
0  NaN  NaN
1  2.0  NaN
2  3.0  2.0
3  4.0  3.0

I tried all of the options above but my DataFrame just won't update!我尝试了上述所有选项,但我的 DataFrame 不会更新!

dropna , like most other functions in the pandas API returns a new DataFrame (a copy of the original with changes) as the result, so you should assign it back if you want to see changes. dropna与 pandas API 中的大多数其他函数一样,会返回一个新的 DataFrame(带有更改的原始副本),因此如果您想查看更改,应该将其分配回去。

df.dropna(...) # wrong
df.dropna(..., inplace=True) # right, but not recommended
df = df.dropna(...) # right

Reference参考

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

 DataFrame.dropna( self, axis=0, how='any', thresh=None, subset=None, inplace=False)

在此处输入图像描述

Simplest of all solutions:最简单的解决方案:

filtered_df = df[df['EPS'].notnull()]

The above solution is way better than using np.isfinite()上述解决方案比使用 np.isfinite() 好得多

Simple and easy way简单易行的方法

df.dropna(subset=['EPS'],inplace=True)

source: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html来源: https ://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

You could use dataframe method notnull or inverse of isnull , or numpy.isnan :您可以使用数据框方法notnullisnullnumpy.isnan的反函数:

In [332]: df[df.EPS.notnull()]
Out[332]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN


In [334]: df[~df.EPS.isnull()]
Out[334]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN


In [347]: df[~np.isnan(df.EPS)]
Out[347]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN

yet another solution which uses the fact that np.nan != np.nan :另一个使用np.nan != np.nan事实的解决方案:

In [149]: df.query("EPS == EPS")
Out[149]:
                 STK_ID  EPS  cash
STK_ID RPT_Date
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

you can use dropna 你可以用dropna

Example

Drop the rows where at least one element is missing. 将行缺失至少一个元素。

df=df.dropna()

Define in which columns to look for missing values. 定义在哪些列中查找缺失值。

df=df.dropna(subset=['column1', 'column1'])

See this for more examples 看到这个更多的例子

Note: axis parameter of dropna is deprecated since version 0.23.0: 注意:dropna的轴参数自0.23.0版本起已弃用:

Or (check for NaN's with isnull , then use ~ to make the opposite to no NaN's): 或(检查带有isnull NaN,然后​​使用~来代替没有NaN的NaN):

df=df[~df['EPS'].isnull()]

Now: 现在:

print(df)

Is: 方法是:

                 STK_ID  EPS  cash
STK_ID RPT_Date
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

这个答案比上述所有答案都简单得多:)

df=df[df['EPS'].notnull()]

另一个版本:

df[~df['EPS'].isna()]

It may be added at that '&' can be used to add additional conditions eg可以添加'&'可以用于添加附加条件,例如

df = df[(df.EPS > 2.0) & (df.EPS <4.0)]

Notice that when evaluating the statements, pandas needs parenthesis.请注意,在评估语句时,pandas 需要括号。

In datasets having large number of columns its even better to see how many columns contain null values and how many don't.在具有大量列的数据集中,最好查看有多少列包含空值以及有多少列不包含。

print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))

print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))

print("Total no. of columns in the dataframe")
print(len(df.columns))

For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.例如,在我的数据框中,它包含 82 列,其中 19 列至少包含一个空值。

Further you can also automatically remove cols and rows depending on which has more null values此外,您还可以根据哪些具有更多空值自动删除列和行
Here is the code which does this intelligently:这是智能执行此操作的代码:

df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)

Note: Above code removes all of your null values.注意:上面的代码删除了所有空值。 If you want null values, process them before.如果您想要空值,请先处理它们。

The following method worked for me.以下方法对我有用。 It would help if none of the above methods work:如果上述方法都不起作用,那将有所帮助:

df[df['colum_name'].str.len() >= 1]

The basic idea is that you pick up the record only if the length strength is greater than 1. This is especially useful if you are dealing with string data基本思想是仅当长度强度大于 1 时才拾取记录。这在处理字符串数据时特别有用

Best!最好的!

You can also use notna inside query :您还可以在query中使用notna

In [4]: df.query('EPS.notna().values')
Out[4]: 
                 STK_ID.1  EPS  cash
STK_ID RPT_Date                     
600016 20111231    600016  4.3   NaN
601939 20111231    601939  2.5   NaN

For some reason none of the previously submitted answers worked for me. 由于某种原因,以前提交的答案都对我不起作用。 This basic solution did: 这个基本解决方案做到了:

df = df[df.EPS >= 0]

Though of course that will drop rows with negative numbers, too. 当然,这也会删除带有负数的行。 So if you want those it's probably smart to add this after, too. 因此,如果您想要这些,在以后添加它可能也很聪明。

df = df[df.EPS <= 0]

One of the solution can be 解决方案之一可以是

df = df[df.isnull().sum(axis=1) <= Cutoff Value]

Another way can be 另一种方法可以是

df= df.dropna(thresh=(df.shape[1] - Cutoff_value))

I hope these are useful. 我希望这些是有用的。

  df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],"toy": [np.nan, 'Batmobile', 'Bullwhip'],"born": [pd.NaT, pd.Timestamp("1940-04-25"),pd.NaT]})

output would be 输出将是

          name        toy        born
   0      Alfred      NaN        NaT
   1      Batman      Batmobile  1940-04-25
   2     Catwoman     Bullwhip   NaT

the desired output 所需的输出

df.dropna()
      name        toy       born
   1  Batman    Batmobile   1940-04-25

您可以尝试:

df['EPS'].dropna()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM