如何从列名中删除非 ASCII 字符和空格

Question

I have a dataframe .我有一个数据框。 Many column names have non ASCII characters and special characters like (), /, +, .许多列名包含非 ASCII 字符和特殊字符，如 ()、/、+、. (non ascii dots in the middle ) etc and non ascii spaces . （中间的非 ascii 点）等和非 ascii 空格。 This did not happen while reading csv .读取 csv 时没有发生这种情况。 This happened due to one - hot encoding.这是由于单热编码而发生的。 (when i converted my categorical variable to numeric columns & category values had non ascii values) （当我将分类变量转换为数字列时，类别值具有非 ascii 值）

df df

Col1/name   Col 2() name    Col3 + name    Col4 ^¨ name   etc...

Expected output预期输出

I want only numbers, underscores and characters in my column names (I only want to change column names not any value in dataframe or rows).我只想要列名中的数字、下划线和字符（我只想更改列名，而不是数据框或行中的任何值）。 This is necessary because some Machine learning algorithms such as lightGBM dont work with non ASCII characters or non ASCII spaces in column names.这是必要的，因为一些机器学习算法（例如 lightGBM）不适用于列名称中的非 ASCII 字符或非 ASCII 空格。

Expected output df:预期输出 df：

Col1name   Col_2_name    Col3__name    Col4__name   etc...

So replacing space with underscores and removing any non- numeric and non-character in column names .所以用下划线替换空格并删除列名中的任何非数字和非字符。

Answer 1

One way using pandas.Series.str.replace and findall :使用pandas.Series.str.replace和findall一种方法：

df.columns = ["".join(l) for l in df.columns.str.replace("\s", "_").str.findall("[\w\d]+")]
print(df)

Output:输出：

Empty DataFrame
Columns: [Col1name, Col_2_name, Col3__name, Col4__name]
Index: []

Answer 2

You can use the method replace :您可以使用方法replace ：

df.columns.str.replace('\s+', '_').str.replace('\W+', '')

Output:输出：

Index(['Col1name', 'Col_2_name', 'Col3__name', 'Col4__name'], dtype='object')

You can remove multiple underscores with str.replace('_{2,}', '_') .您可以使用str.replace('_{2,}', '_')删除多个下划线。

如何从列名中删除非 ASCII 字符和空格

问题描述

2 个解决方案

解决方案1
6 已采纳 2020-03-06 08:00:30

解决方案2
1 2020-03-06 08:23:30

如何从列名中删除非 ASCII 字符和空格

问题描述

2 个解决方案

解决方案1 6 已采纳 2020-03-06 08:00:30

解决方案2 1 2020-03-06 08:23:30

解决方案1
6 已采纳 2020-03-06 08:00:30

解决方案2
1 2020-03-06 08:23:30