I have DataFrame in Python Pandas like below (both types of columns: numeric and object):
data types:
COL1 | COL2 | COL3 | ... | COLn |
---|---|---|---|---|
111 | A | Y | ... | ... |
222 | A | Y | ... | ... |
333 | B | Z | ... | ... |
444 | C | Z | ... | ... |
555 | D | P | ... | ... |
And i need to make dummy coding ( pandas.get_dummies()
) only on categorical variables which has:
So, for example:
So, as a result I need something like below:
COL1 | COL3_Y | COL3_Z | ... | COLn
-----|--------|--------|------|------
111 | 1 | 0 | ... | ...
222 | 1 | 0 | ... | ...
333 | 0 | 1 | ... | ...
444 | 0 | 1 | ... | ...
555 | 0 | 0 | ... | ...
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
{
"COL1": [111, 222, 333, 444, 555],
"COL2": ["A", "A", "B", "C", "D"],
"COL3": ["Y", "Y", "Z", "Z", "P"],
"COL4": ["U", "U", "W", "V", "V"],
}
)
Here is one way to do it:
# Setup
new_df = df["COL1"]
s = df.nunique()
# Filter out rows with too many categories
tmp = df.loc[:, s[s <= 3].index]
# Filter out values with insuffisant percentage
# Get dummies and concat new columns
for col in tmp.columns:
frq = tmp[col].value_counts() / tmp.shape[0]
other_tmp = tmp[col]
other_tmp = other_tmp[
other_tmp.isin(frq[frq >= 0.4].index.get_level_values(0).tolist())
]
other_tmp = pd.get_dummies(other_tmp)
new_df = pd.concat([new_df, other_tmp], axis=1)
# Cleanup
new_df = new_df.fillna(0).astype(int)
Then:
print(new_df)
# Output
COL1 Y Z U V
0 111 1 0 1 0
1 222 1 0 1 0
2 333 0 1 0 0
3 444 0 1 0 1
4 555 0 0 0 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.