[英]Converting dtypes in messy pandas data-frame?
I have a big data-frame.我有一个大数据框。 I want to convert them to the appropriate dtype.
我想将它们转换为适当的 dtype。 The problem is that in several numeric columns there are strings.
问题是在几个数字列中有字符串。 I know about convert_dtypes and to_numeric.
我知道 convert_dtypes 和 to_numeric。 With the former the problems is that it doesn't infer a column as int/float as soon as there strings there, to_numeric on the other hand has "coerce" which turns all the invalid examples to nan.
对于前者,问题在于它不会将列推断为 int/float,而另一方面,to_numeric 具有“强制”,它将所有无效示例转换为 nan。 The problem with to_numeric is that there are several columns that are strings, so I can't just run it on all columns.
to_numeric 的问题是有几列是字符串,所以我不能只在所有列上运行它。
So I am looking for a function that convert dtypes to numeric if there is a certain % of numeric values in it.所以我正在寻找一个 function 如果其中有一定百分比的数值,它会将 dtypes 转换为数字。 It would be great if one could set the threshold for this.
如果可以为此设置门槛,那就太好了。
As mentioned before the dataset is large, so I would prefer some solution that handles all the columns automatically.如前所述,数据集很大,所以我更喜欢自动处理所有列的解决方案。
Use custom function with convert columns to numeric and if match condition return numeric column else original column in DataFrame.apply
:使用自定义 function 将列转换为数字,如果匹配条件返回数字列,否则
DataFrame.apply
中的原始列:
print (df)
a b c d e
0 1 5 4 3 8
1 7 8 9 f 9
2 c c g g 4
3 4 t r e 4
def f(x, thresh):
y = pd.to_numeric(x, errors='coerce')
return y if y.notna().mean() > thresh else x
thresh = 0.7
df1 = df.apply(f, args= (thresh,))
print (df1)
a b c d e
0 1.0 5 4 3 8
1 7.0 8 9 f 9
2 NaN c g g 4
3 4.0 t r e 4
print (df1.dtypes)
a float64
b object
c object
d object
e int64
dtype: object
Modified solution with missing values (if exist):具有缺失值的修改解决方案(如果存在):
print (df)
a b c d e
0 1 5 4 3 8
1 7 8 NaN f 9
2 c c NaN g 4
3 4 t r e 4
def f(x, thresh):
y = pd.to_numeric(x, errors='coerce')
return y if (y.notna() | x.isna()).mean() > thresh else x
thresh = 0.7
df1 = df.apply(f, args= (thresh,))
print (df1)
a b c d e
0 1.0 5 4.0 3 8
1 7.0 8 NaN f 9
2 NaN c NaN g 4
3 4.0 t NaN e 4
print (df1.dtypes)
a float64
b object
c float64
d object
e int64
dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.