简体   繁体   English

在 pandas 中查找 dataframe 中的非数字行?

[英]Finding non-numeric rows in dataframe in pandas?

I have a large dataframe in pandas that apart from the column used as index is supposed to have only numeric values:我在 pandas 中有一个大的 dataframe,除了用作索引的列应该只有数值:

df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
                   'b': [0.1, 0.2, 0.3, 0.4, 0.5],
                   'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')

How can I find the row of the dataframe df that has a non-numeric value in it?如何找到 dataframe df中包含非数字值的行?

In this example it's the fourth row in the dataframe, which has the string 'bad' in the a column.在此示例中,它是 dataframe 中的第四行,其a列中包含字符串'bad' How can this row be found programmatically?如何以编程方式找到这一行?

You could use np.isreal to check the type of each element ( applymap applies a function to each element in the DataFrame): 您可以使用np.isreal检查每个元素的类型( applymap将一个函数应用于DataFrame中的每个元素):

In [11]: df.applymap(np.isreal)
Out[11]:
          a     b
item
a      True  True
b      True  True
c      True  True
d     False  True
e      True  True

If all in the row are True then they are all numeric: 如果该行中的所有均为True,则它们都是数字:

In [12]: df.applymap(np.isreal).all(1)
Out[12]:
item
a        True
b        True
c        True
d       False
e        True
dtype: bool

So to get the subDataFrame of rouges, (Note: the negation, ~, of the above finds the ones which have at least one rogue non-numeric): 因此,要获得路由的subDataFrame,(请注意:上面的否定〜会找到至少具有一个非流氓非数字字符的那些):

In [13]: df[~df.applymap(np.isreal).all(1)]
Out[13]:
        a    b
item
d     bad  0.4

You could also find the location of the first offender you could use argmin : 您还可以找到可以使用argmin第一个罪犯的位置:

In [14]: np.argmin(df.applymap(np.isreal).all(1))
Out[14]: 'd'

As @CTZhu points out, it may be slightly faster to check whether it's an instance of either int or float (there is some additional overhead with np.isreal): 正如@CTZhu指出的那样, 检查它是int还是float 的实例可能会稍快一些(np.isreal会有一些额外的开销):

df.applymap(lambda x: isinstance(x, (int, float)))

Already some great answers to this question, however here is a nice snippet that I use regularly to drop rows if they have non-numeric values on some columns: 对于这个问题已经有了一些不错的答案,但是如果有行在某些列上使用非数字值,我会定期使用它来删除行:

# Eliminate invalid data from dataframe (see Example below for more context)

num_df = (df.drop(data_columns, axis=1)
         .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))

num_df = num_df[num_df[data_columns].notnull().all(axis=1)]

The way this works is we first drop all the data_columns from the df , and then use a join to put them back in after passing them through pd.to_numeric (with option 'coerce' , such that all non-numeric entries are converted to NaN ). 该方法这是工作,我们首先drop所有data_columnsdf ,然后用join到将它们放回使它们通过后pd.to_numeric (与选项'coerce' ,这样所有非数字输入转换为NaN )。 The result is saved to num_df . 结果保存到num_df

On the second line we use a filter that keeps only rows where all values are not null. 在第二行中,我们使用一个过滤器,该过滤器仅保留所有值都不为null的行。

Note that pd.to_numeric is coercing to NaN everything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. 请注意, pd.to_numeric将所有无法转换为数字值的内容强制转换为NaN ,因此不会删除表示数字值的字符串。 For example '1.25' will be recognized as the numeric value 1.25 . 例如, '1.25'将被识别为数值1.25

Disclaimer: pd.to_numeric was introduced in pandas version 0.17.0 免责声明: pd.to_numeric在pandas版本0.17.0引入

Example: 例:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"],
   ...:                    "a": [1,2,3,"bad",5],
   ...:                    "b":[0.1,0.2,0.3,0.4,0.5]})

In [3]: df
Out[3]: 
     a    b item
0    1  0.1    a
1    2  0.2    b
2    3  0.3    c
3  bad  0.4    d
4    5  0.5    e

In [4]: data_columns = ['a', 'b']

In [5]: num_df = (df
   ...:           .drop(data_columns, axis=1)
   ...:           .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))

In [6]: num_df
Out[6]: 
  item   a    b
0    a   1  0.1
1    b   2  0.2
2    c   3  0.3
3    d NaN  0.4
4    e   5  0.5

In [7]: num_df[num_df[data_columns].notnull().all(axis=1)]
Out[7]: 
  item  a    b
0    a  1  0.1
1    b  2  0.2
2    c  3  0.3
4    e  5  0.5

Sorry about the confusion, this should be the correct approach. 抱歉,这是正确的方法。 Do you want only to capture 'bad' only, not things like 'good' ; 您是否只想捕捉'bad' ,而不是'good' Or just any non-numerical values? 还是只是任何非数字值?

In[15]:
np.where(np.any(np.isnan(df.convert_objects(convert_numeric=True)), axis=1))
Out[15]:
(array([3]),)
# Original code
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
                   'b': [0.1, 0.2, 0.3, 0.4, 0.5],
                   'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')

Convert to numeric using 'coerce' which fills bad values with 'nan' 使用'coerce' 转换为数值,并用'nan'填充错误值

a = pd.to_numeric(df.a, errors='coerce')

Use isna to return a boolean index: 使用isna返回布尔值索引:

idx = a.isna()

Apply that index to the data frame: 将该索引应用于数据框:

df[idx]

output 输出

Returns the row with the bad data in it: 返回其中包含错误数据的行:

        a    b
item          
d     bad  0.4

In case you are working with a column with string values, you can use THE VERY USEFUL function series.str.isnumeric() like: 如果您正在使用带有字符串值的列,则可以使用非常有用的函数series.str.isnumeric(),例如:

a = pd.Series(['hi','hola','2.31','288','312','1312', '0,21', '0.23'])

What i do is to copy that column to new column, and do a str.replace('.','') and str.replace(',','') then i select the numeric values. 我要做的是将该列复制到新列,并执行str.replace('。','')和str.replace(',',''),然后选择数值。 and: 和:

a = a.str.replace('.','')
a = a.str.replace(',','') 
a.str.isnumeric()

Out[15]: 0 False 1 False 2 True 3 True 4 True 5 True 6 True 7 True dtype: bool Out [15]:0错误1错误2对3对4对5对6对6对7对dtype:bool

Good luck all! 祝你好运!

I'm thinking something like, just give an idea, to convert the column to string, and work with string is easier. 我在想类似的东西,只是给出一个想法,即可将列转换为字符串,并且使用字符串更容易。 however this does not work with strings containing numbers, like bad123 . 但是,这不适用于包含数字的字符串,例如bad123 and ~ is taking the complement of selection. ~是选择的补充。

df['a'] = df['a'].astype(str)
df[~df['a'].str.contains('0|1|2|3|4|5|6|7|8|9')]
df['a'] = df['a'].astype(object)

and using '|'.join([str(i) for i in range(10)]) to generate '0|1|...|8|9' 并使用'|'.join([str(i) for i in range(10)])生成'0|1|...|8|9'

or using np.isreal() function, just like the most voted answer 或使用np.isreal()函数,就像投票最多的答案一样

df[~df['a'].apply(lambda x: np.isreal(x))]

Did you convert your data using.astype()?您是否使用 .astype() 转换了数据?

All great comments above must solve 99% of the cases, but if you are still in trouble, please also check if you converted your data type.以上所有精彩评论一定能解决 99% 的情况,但如果您仍然遇到问题,请同时检查您是否转换了数据类型。

Sometimes I force the data to type float16 to save memory. Using:有时我强制数据键入 float16 以保存 memory。使用:

df[col] = df[col].astype(np.float16)

But this might silently break your code.但这可能会悄悄地破坏你的代码。 So if you did any kind of data type transformation, double check for overflows .因此,如果您进行了任何类型的数据类型转换,请仔细检查是否存在溢出 Disable the conversion and try again.禁用转换并重试。

It worked for me!它对我有用!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM