在凌乱的 pandas 数据帧中转换 dtypes？

Question

I have a big data-frame.我有一个大数据框。 I want to convert them to the appropriate dtype.我想将它们转换为适当的 dtype。 The problem is that in several numeric columns there are strings.问题是在几个数字列中有字符串。 I know about convert_dtypes and to_numeric.我知道 convert_dtypes 和 to_numeric。 With the former the problems is that it doesn't infer a column as int/float as soon as there strings there, to_numeric on the other hand has "coerce" which turns all the invalid examples to nan.对于前者，问题在于它不会将列推断为 int/float，而另一方面，to_numeric 具有“强制”，它将所有无效示例转换为 nan。 The problem with to_numeric is that there are several columns that are strings, so I can't just run it on all columns. to_numeric 的问题是有几列是字符串，所以我不能只在所有列上运行它。

So I am looking for a function that convert dtypes to numeric if there is a certain % of numeric values in it.所以我正在寻找一个 function 如果其中有一定百分比的数值，它会将 dtypes 转换为数字。 It would be great if one could set the threshold for this.如果可以为此设置门槛，那就太好了。

As mentioned before the dataset is large, so I would prefer some solution that handles all the columns automatically.如前所述，数据集很大，所以我更喜欢自动处理所有列的解决方案。

Answer 1

Use custom function with convert columns to numeric and if match condition return numeric column else original column in DataFrame.apply :使用自定义 function 将列转换为数字，如果匹配条件返回数字列，否则DataFrame.apply中的原始列：

print (df)
   a  b  c  d  e
0  1  5  4  3  8
1  7  8  9  f  9
2  c  c  g  g  4
3  4  t  r  e  4

def f(x, thresh):
    y = pd.to_numeric(x, errors='coerce')
    return y if y.notna().mean() > thresh else x

thresh = 0.7
df1 = df.apply(f, args= (thresh,))
print (df1)
     a  b  c  d  e
0  1.0  5  4  3  8
1  7.0  8  9  f  9
2  NaN  c  g  g  4
3  4.0  t  r  e  4

print (df1.dtypes)
a    float64
b     object
c     object
d     object
e      int64
dtype: object

Modified solution with missing values (if exist):具有缺失值的修改解决方案（如果存在）：

print (df)
   a  b    c  d  e
0  1  5    4  3  8
1  7  8  NaN  f  9
2  c  c  NaN  g  4
3  4  t    r  e  4

def f(x, thresh):
    y = pd.to_numeric(x, errors='coerce')
    return y if (y.notna() | x.isna()).mean() > thresh else x

thresh = 0.7
df1 = df.apply(f, args= (thresh,))
print (df1)
     a  b    c  d  e
0  1.0  5  4.0  3  8
1  7.0  8  NaN  f  9
2  NaN  c  NaN  g  4
3  4.0  t  NaN  e  4

print (df1.dtypes)
a    float64
b     object
c    float64
d     object
e      int64
dtype: object

在凌乱的 pandas 数据帧中转换 dtypes？

问题描述

1 个解决方案

解决方案1
1 2021-12-17 09:17:01

在凌乱的 pandas 数据帧中转换 dtypes？

问题描述

1 个解决方案

解决方案1 1 2021-12-17 09:17:01

解决方案1
1 2021-12-17 09:17:01