简体   繁体   English

如何从列名中删除非 ASCII 字符和空格

[英]How to remove non-ASCII characters and space from column names

I have a dataframe .我有一个数据框。 Many column names have non ASCII characters and special characters like (), /, +, .许多列名包含非 ASCII 字符特殊字符,如 ()、/、+、. (non ascii dots in the middle ) etc and non ascii spaces . (中间的非 ascii 点)等和非 ascii 空格 This did not happen while reading csv .读取 csv 时没有发生这种情况 This happened due to one - hot encoding.是由于单热编码而发生的。 (when i converted my categorical variable to numeric columns & category values had non ascii values) (当我将分类变量转换为数字列时,类别值具有非 ascii 值)

df df

Col1/name   Col 2() name    Col3 + name    Col4 ^¨ name   etc...

Expected output预期输出

I want only numbers, underscores and characters in my column names (I only want to change column names not any value in dataframe or rows).只想要列名中的数字、下划线和字符(我只想更改列名,而不是数据框或行中的任何值)。 This is necessary because some Machine learning algorithms such as lightGBM dont work with non ASCII characters or non ASCII spaces in column names.这是必要的,因为一些机器学习算法(例如 lightGBM)不适用于列名称中的非 ASCII 字符或非 ASCII 空格。

Expected output df:预期输出 df:

Col1name   Col_2_name    Col3__name    Col4__name   etc...

So replacing space with underscores and removing any non- numeric and non-character in column names .所以用下划线替换空格并删除列名中的任何非数字和非字符。

One way using pandas.Series.str.replace and findall :使用pandas.Series.str.replacefindall一种方法:

df.columns = ["".join(l) for l in df.columns.str.replace("\s", "_").str.findall("[\w\d]+")]
print(df)

Output:输出:

Empty DataFrame
Columns: [Col1name, Col_2_name, Col3__name, Col4__name]
Index: []

You can use the method replace :您可以使用方法replace

df.columns.str.replace('\s+', '_').str.replace('\W+', '')

Output:输出:

Index(['Col1name', 'Col_2_name', 'Col3__name', 'Col4__name'], dtype='object')

You can remove multiple underscores with str.replace('_{2,}', '_') .您可以使用str.replace('_{2,}', '_')删除多个下划线。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM