How to deal with "Do not support non-ASCII characters in feature name" error when I use lightGBM?

Question

I want to do machine learning with lightGBM in python.
I'm using pandas.DataFrame with column names in Japanese as input for learning.
Until the other day, I was able to learn without any error.

However, I had the opportunity to reinstall anaconda , and at the same time, installed lightGBM using conda .
Since then, the following error has appeared.

LightGBMError: Do not support non-ASCII characters in feature name.

When I changed the column name from 0 to a natural number, I learned as usual.
This is probably because the column name is in Japanese as indicated by the error.
(This error occurs both for training with train() and learning with fit().)

I want you to know the following two points.

Why can't I use Japanese column names as before?
Is there a way to use Japanese column names as before?

The environment I am using is as follows.

OS: Windows 10 home  
Coding environment: Jupyter notebook  
python version: 3.7.6  
lightGBM version: 2.3.1

If you know the answer to my question, please tell me.
Sorry for my poor English.

Answer 1

Recently, the previous code could not be run. I think it seems that I upgraded the version of lgb in the middle and then reported an error. Now I roll back 2.2.3 and return to normal.

Answer 2

you can clean up column names with a simple instruction:

import re
df = df.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))

Answer 3

lightgbm 3.0.0 (August 2020) added support for non-ASCII feature names back to LightGBM.

Upgrade to at least lightgbm 3.0.0 (the newest version is 3.1.0).

pip install --upgrade 'lightgbm>=3.0.0'

You can test with this example code I've provided below, which was originally provided in microsoft/LightGBM#2976 . In the future, please provide a small, reproducible code sample in your question if possible.

import lightgbm
import numpy
from matplotlib import pyplot

numpy.random.seed(42)

X = numpy.random.normal(size=(1000, 3))
y = numpy.random.random(1000)

train_lgb = lightgbm.Dataset(X, y)

feature_names = ['F_零', 'F_一', 'F_二']

params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'l2',
    'num_leaves': 31,
    'verbose': 0,
}

print('Starting training...')
gbm = lightgbm.train(
    params,
    train_lgb,
    num_boost_round=10,
    feature_name=feature_names,
)

print('Plotting feature importances...')
ax = lightgbm.plot_importance(gbm, ignore_zero=False)
pyplot.show()

How to deal with "Do not support non-ASCII characters in feature name" error when I use lightGBM?

Question

3 answers

solution1
1 2020-03-19 06:15:29

solution2
0 2020-06-14 05:42:28

solution3
0 2020-11-29 04:51:36

How to deal with "Do not support non-ASCII characters in feature name" error when I use lightGBM?

Question

3 answers

solution1 1 2020-03-19 06:15:29

solution2 0 2020-06-14 05:42:28

solution3 0 2020-11-29 04:51:36

solution1
1 2020-03-19 06:15:29

solution2
0 2020-06-14 05:42:28

solution3
0 2020-11-29 04:51:36