I have a DataFrame containing strings. I would like to create another DataFrame that indicates whether the string contains a specific month through one-hot encoding.
Using the below as an example:
data = {
'User': ['1', '2', '3', '4']
'Months': ['January; February', 'March; August', 'October; January', 'August, December']}
df = pd.DataFrame(data, columns = ['User','Months'])
I am looking to produce the following sort of DataFrame:
| January | August |
User | 1 | 1 | 0 |
| 2 | 0 | 1 |
| 3 | 1 | 0 |
| 4 | 0 | 1 |
I have tried the following but I get a value error and it also would not produce the desired DataFrame:
if df[df['Months'].str.contains('January')]:
print("1")
else:
print("0")
Thanks in advance!
You can use series.str.extract
first to extract the specific substrings and use it with get_dummies
then join
back:
l = ['January','August']
out = df[['User']].join(
pd.get_dummies(df['Months'].str.extract(f"({'|'.join(l)})",expand=False)))
print(out)
User August January
0 1 0 1
1 2 1 0
2 3 0 1
3 4 1 0
df = pd.concat([df["User"], df.Months.str.split(r"[,;]")], axis=1).explode(
"Months"
)
print(pd.crosstab(df["User"], df["Months"]))
Prints:
Months August December February January August January March October
User
1 0 0 1 0 0 1 0 0
2 1 0 0 0 0 0 1 0
3 0 0 0 1 0 0 0 1
4 0 1 0 0 1 0 0 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.