Left-side CSV file has five columns .application
column has several app types delimited with ;
. Depending on the app
, device
and district
type, I want to predict the target
. But I want to first convert the file in to the right-hand side dataframe to apply machine learning.
How can I do this using python?
You need to apply multi-hot encoding for application
column and one hot encoding for other columns.
Here is my solution!
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'number': np.random.randint(0,10,size=5),
'device': np.random.choice(['a','b'],size=5),
'application': ['app2;app3','app1','app2;app4', 'app1;app2', 'app1'],
'district': np.random.choice(['aa', 'bb', 'cc'],size=5)})
>>> df
application device district number
0 app2;app3 b aa 3
1 app1 a cc 7
2 app2;app4 a aa 3
3 app1;app2 b bb 9
4 app1 a cc 4
from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer
mlb = MultiLabelBinarizer()
# Assuming appl names are separated by ;
mhv = mlb.fit_transform(df['application'].apply(lambda x: set(x.split(';'))))
df_out = pd.DataFrame(mhv,columns=mlb.classes_)
enc = OneHotEncoder(sparse=False)
ohe_vars = ['device','district'] # specify the list of columns here
ohv = enc.fit_transform(df.loc[:,ohe_vars])
ohe_col_names = ['%s_%s'%(var,cat) for var,cats in zip(ohe_vars, enc.categories_) for cat in cats]
df_out.assign(**dict(zip(ohe_col_names,ohv.T)))
df_out
You can first read in the csv the way it is and then do all of the editing. So first..
df = pd.read_csv("data.csv")
then take the data in the application column and make it into a list. Then take the list and join it into a string and split it back up at the ";" deliminator.
col = list(df.application)
col_join = ";".join(col)
col_split = col_join.split(";")
After you have that list you can loop through and delete the duplicates...
i=0
while i < len(col_split):
j=i+1
while j < len(col_split):
if col_split[i] == col_split[j]:
del(col_split[j])
else:
j += 1
i += 1
Now you have a list of all of the unique application entries that you can read into a data frame to use as columns.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.