[英]Remove non-numeric rows in one column with pandas
There is a dataframe like the following, and it has one unclean column 'id' which it sholud be numeric column有一个如下所示的数据框,它有一个不干净的列“id”,它应该是数字列
id, name
1, A
2, B
3, C
tt, D
4, E
5, F
de, G
Is there a concise way to remove the rows because tt and de are not numeric values是否有一种简洁的方法来删除行,因为 tt 和 de 不是数值
tt,D
de,G
to make the dataframe clean?使数据框干净?
id, name
1, A
2, B
3, C
4, E
5, F
Using pd.to_numeric
使用
pd.to_numeric
In [1079]: df[pd.to_numeric(df['id'], errors='coerce').notnull()]
Out[1079]:
id name
0 1 A
1 2 B
2 3 C
4 4 E
5 5 F
You could use standard method of strings isnumeric
and apply it to each value in your id
column:您可以使用字符串
isnumeric
的标准方法并将其应用于id
列中的每个值:
import pandas as pd
from io import StringIO
data = """
id,name
1,A
2,B
3,C
tt,D
4,E
5,F
de,G
"""
df = pd.read_csv(StringIO(data))
In [55]: df
Out[55]:
id name
0 1 A
1 2 B
2 3 C
3 tt D
4 4 E
5 5 F
6 de G
In [56]: df[df.id.apply(lambda x: x.isnumeric())]
Out[56]:
id name
0 1 A
1 2 B
2 3 C
4 4 E
5 5 F
Or if you want to use id
as index you could do:或者,如果您想使用
id
作为索引,您可以这样做:
In [61]: df[df.id.apply(lambda x: x.isnumeric())].set_index('id')
Out[61]:
name
id
1 A
2 B
3 C
4 E
5 F
Although case with pd.to_numeric
is not using apply
method it is almost two times slower than with applying np.isnumeric
for str
columns.尽管
pd.to_numeric
的情况没有使用apply
方法,但它几乎比对str
列应用np.isnumeric
慢两倍。 Also I add option with using pandas str.isnumeric
which is less typing and still faster then using pd.to_numeric
.我还添加了使用 pandas
str.isnumeric
的选项,该选项比使用pd.to_numeric
输入更少,而且速度更快。 But pd.to_numeric
is more general because it could work with any data types (not only strings).但是
pd.to_numeric
更通用,因为它可以处理任何数据类型(不仅仅是字符串)。
df_big = pd.concat([df]*10000)
In [3]: df_big = pd.concat([df]*10000)
In [4]: df_big.shape
Out[4]: (70000, 2)
In [5]: %timeit df_big[df_big.id.apply(lambda x: x.isnumeric())]
15.3 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]: %timeit df_big[df_big.id.str.isnumeric()]
20.3 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit df_big[pd.to_numeric(df_big['id'], errors='coerce').notnull()]
29.9 ms ± 682 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Given that df
is your dataframe,鉴于
df
是您的数据框,
import numpy as np
df[df['id'].apply(lambda x: isinstance(x, (int, np.int64)))]
What it does is passing each value in the id
column to the isinstance
function and checks if it's an int
.它所做的是将
id
列中的每个值传递给isinstance
函数并检查它是否为int
。 Then it returns a boolean array, and finally returning only the rows where there is True
.然后它返回一个布尔数组,最后只返回存在
True
的行。
If you also need to account for float
values, another option is:如果您还需要考虑
float
值,另一种选择是:
import numpy as np
df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]
Note that either way is not inplace, so you will need to reassign it to your original df, or create a new one:请注意,任何一种方式都不是就地的,因此您需要将其重新分配给原始 df,或创建一个新的:
df = df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]
# or
new_df = df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]
x.isnumeric()
does not test return True
when x
is of type float
.当
x
为float
类型时, x.isnumeric()
不会测试返回True
。
One way to filter out values which can be converted to float
:过滤掉可以转换为
float
的值的一种方法:
df[df['id'].apply(lambda x: is_float(x))]
def is_float(x):
try:
float(x)
except ValueError:
return False
return True
How about this?这个怎么样? The
.str
accessor is one of my favorites :) .str
访问器是我的最爱之一 :)
import pandas as pd
df = pd.DataFrame(
{
'id': {0: '1', 1: '2', 2: '3', 3: 'tt', 4: '4', 5: '5', 6: 'de'},
'name': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G'}
}
)
df_clean = df[df.id.str.isnumeric()]
Supplement (2021-06-22)补充 (2021-06-22)
If the id
contains some kind of headache-makers (such as float
, None
, nan
), you can forcefully cast them to the str
data type using astype('str')
.如果
id
包含某种令人头疼的东西(例如float
、 None
、 nan
),您可以使用astype('str')
将它们强制转换为str
数据类型。
import numpy as np
import pandas as pd
df = pd.DataFrame(
{
'id': {0: '1', 1: '2', 2: '3', 3: 3.14, 4: '4', 5: '5', 6: None, 7: np.nan},
'name': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H'}
}
)
df_clean = df[df.id.astype('str').str.isnumeric()]
Primitive, but it works anyway.原始的,但它仍然有效。
This is a dynamic way to do it, this only works for int64 and float 64, if you have other numeric data types in your dataframe make sure you add them to the if statement这是一种动态方法,仅适用于 int64 和 float 64,如果您的数据框中有其他数字数据类型,请确保将它们添加到 if 语句
# make dataframe of column data types
col_types = df.dtypes.to_frame()
col_types.columns = ['dtype']
#make list of zeros
drop_it = [0]*col_types.shape[0]
k = 0
#make it a one if the data isn't numeric
#if you have other numeric types you need to add them to if statement
for t in col_types.dtype:
if t != 'int64' and t != 'float64':
drop_it[k] = 1
k = k + 1
#delete types from drop list that aren't numeric
col_types['drop_it'] = drop_it
col_types = col_types.loc[col_types["drop_it"] == 1]
#finally drop columns that are in drop list
for col_to_drop in col_types.index.values.tolist():
df = df.drop([col_to_drop], axis = 1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.