简体   繁体   English

用熊猫删除一列中的非数字行

[英]Remove non-numeric rows in one column with pandas

There is a dataframe like the following, and it has one unclean column 'id' which it sholud be numeric column有一个如下所示的数据框,它有一个不干净的列“id”,它应该是数字列

id, name
1,  A
2,  B
3,  C
tt, D
4,  E
5,  F
de, G

Is there a concise way to remove the rows because tt and de are not numeric values是否有一种简洁的方法来删除行,因为 tt 和 de 不是数值

tt,D
de,G

to make the dataframe clean?使数据框干净?

id, name
1,  A
2,  B
3,  C
4,  E
5,  F

Using pd.to_numeric使用pd.to_numeric

In [1079]: df[pd.to_numeric(df['id'], errors='coerce').notnull()]
Out[1079]:
  id  name
0  1     A
1  2     B
2  3     C
4  4     E
5  5     F

You could use standard method of strings isnumeric and apply it to each value in your id column:您可以使用字符串isnumeric的标准方法并将其应用于id列中的每个值:

import pandas as pd
from io import StringIO

data = """
id,name
1,A
2,B
3,C
tt,D
4,E
5,F
de,G
"""

df = pd.read_csv(StringIO(data))

In [55]: df
Out[55]: 
   id name
0   1    A
1   2    B
2   3    C
3  tt    D
4   4    E
5   5    F
6  de    G

In [56]: df[df.id.apply(lambda x: x.isnumeric())]
Out[56]: 
  id name
0  1    A
1  2    B
2  3    C
4  4    E
5  5    F

Or if you want to use id as index you could do:或者,如果您想使用id作为索引,您可以这样做:

In [61]: df[df.id.apply(lambda x: x.isnumeric())].set_index('id')
Out[61]: 
   name
id     
1     A
2     B
3     C
4     E
5     F

Edit.编辑。 Add timings添加时间

Although case with pd.to_numeric is not using apply method it is almost two times slower than with applying np.isnumeric for str columns.尽管pd.to_numeric的情况没有使用apply方法,但它几乎比对str列应用np.isnumeric慢两倍。 Also I add option with using pandas str.isnumeric which is less typing and still faster then using pd.to_numeric .我还添加了使用 pandas str.isnumeric的选项,该选项比使用pd.to_numeric输入更少,而且速度更快。 But pd.to_numeric is more general because it could work with any data types (not only strings).但是pd.to_numeric更通用,因为它可以处理任何数据类型(不仅仅是字符串)。

df_big = pd.concat([df]*10000)

In [3]: df_big = pd.concat([df]*10000)

In [4]: df_big.shape
Out[4]: (70000, 2)

In [5]: %timeit df_big[df_big.id.apply(lambda x: x.isnumeric())]
15.3 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit df_big[df_big.id.str.isnumeric()]
20.3 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit df_big[pd.to_numeric(df_big['id'], errors='coerce').notnull()]
29.9 ms ± 682 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Given that df is your dataframe,鉴于df是您的数据框,

import numpy as np
df[df['id'].apply(lambda x: isinstance(x, (int, np.int64)))]

What it does is passing each value in the id column to the isinstance function and checks if it's an int .它所做的是将id列中的每个值传递给isinstance函数并检查它是否为int Then it returns a boolean array, and finally returning only the rows where there is True .然后它返回一个布尔数组,最后只返回存在True的行。

If you also need to account for float values, another option is:如果您还需要考虑float值,另一种选择是:

import numpy as np
df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]

Note that either way is not inplace, so you will need to reassign it to your original df, or create a new one:请注意,任何一种方式都不是就地的,因此您需要将其重新分配给原始 df,或创建一个新的:

df = df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]
# or
new_df = df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]

x.isnumeric() does not test return True when x is of type float .xfloat类型时, x.isnumeric()不会测试返回True

One way to filter out values which can be converted to float :过滤掉可以转换为float的值的一种方法:

df[df['id'].apply(lambda x: is_float(x))]

def is_float(x):
    try:
        float(x)
    except ValueError:
        return False
    return True

How about this?这个怎么样? The .str accessor is one of my favorites :) .str访问器是我的最爱之一 :)

import pandas as pd


df = pd.DataFrame(
    {
        'id':   {0: '1', 1: '2', 2: '3', 3: 'tt', 4: '4', 5: '5', 6: 'de'},
        'name': {0: 'A', 1: 'B', 2: 'C', 3: 'D',  4: 'E', 5: 'F', 6: 'G'}
    }
)

df_clean = df[df.id.str.isnumeric()]

Supplement (2021-06-22)补充 (2021-06-22)

If the id contains some kind of headache-makers (such as float , None , nan ), you can forcefully cast them to the str data type using astype('str') .如果id包含某种令人头疼的东西(例如floatNonenan ),您可以使用astype('str')将它们强制转换为str数据类型。

import numpy as np
import pandas as pd


df = pd.DataFrame(
    {
        'id':   {0: '1', 1: '2', 2: '3', 3: 3.14, 4: '4', 5: '5', 6: None, 7: np.nan},
        'name': {0: 'A', 1: 'B', 2: 'C', 3: 'D',  4: 'E', 5: 'F', 6: 'G',  7: 'H'}
    }
)

df_clean = df[df.id.astype('str').str.isnumeric()]

Primitive, but it works anyway.原始的,但它仍然有效。

This is a dynamic way to do it, this only works for int64 and float 64, if you have other numeric data types in your dataframe make sure you add them to the if statement这是一种动态方法,仅适用于 int64 和 float 64,如果您的数据框中有其他数字数据类型,请确保将它们添加到 if 语句

# make dataframe of column data types
col_types = df.dtypes.to_frame()
col_types.columns = ['dtype']

#make list of zeros
drop_it = [0]*col_types.shape[0]
k = 0

#make it a one if the data isn't numeric
#if you have other numeric types you need to add them to if statement
for t in col_types.dtype:
    if t != 'int64' and t != 'float64':
        drop_it[k] = 1
    k = k + 1

#delete types from drop list that aren't numeric
col_types['drop_it'] = drop_it
col_types = col_types.loc[col_types["drop_it"] == 1]

#finally drop columns that are in drop list
for col_to_drop in col_types.index.values.tolist():
    df = df.drop([col_to_drop], axis = 1)

Another alternative is to use the query method:另一种选择是使用query方法:

In [5]: df.query('id.str.isnumeric()')
Out[5]: 
  id  name
0  1     A
1  2     B
2  3     C
4  4     E
5  5     F

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM