Python DataFrame: One-Hot Encode Rows Containing a Specific Substring

Question

I have a DataFrame containing strings. I would like to create another DataFrame that indicates whether the string contains a specific month through one-hot encoding.

Using the below as an example:

data = {
'User': ['1', '2', '3', '4']
'Months': ['January; February', 'March; August', 'October; January', 'August, December']}


df = pd.DataFrame(data, columns = ['User','Months'])

I am looking to produce the following sort of DataFrame:

         | January | August |
User | 1 |    1    |    0   |
     | 2 |    0    |    1   |
     | 3 |    1    |    0   |
     | 4 |    0    |    1   |

I have tried the following but I get a value error and it also would not produce the desired DataFrame:

if df[df['Months'].str.contains('January')]:
    print("1")
else:
    print("0")

Thanks in advance!

Answer 1

You can use series.str.extract first to extract the specific substrings and use it with get_dummies then join back:

l = ['January','August']
out = df[['User']].join(
pd.get_dummies(df['Months'].str.extract(f"({'|'.join(l)})",expand=False)))

print(out)

  User  August  January
0    1       0        1
1    2       1        0
2    3       0        1
3    4       1        0

Answer 2

df = pd.concat([df["User"], df.Months.str.split(r"[,;]")], axis=1).explode(
    "Months"
)
print(pd.crosstab(df["User"], df["Months"]))

Prints:

Months   August   December   February   January  August  January  March  October
User                                                                            
1             0          0          1         0       0        1      0        0
2             1          0          0         0       0        0      1        0
3             0          0          0         1       0        0      0        1
4             0          1          0         0       1        0      0        0

Python DataFrame: One-Hot Encode Rows Containing a Specific Substring

Question

2 answers

solution1
2 2021-04-26 16:48:09

solution2
1 ACCPTED 2021-04-26 16:45:11

Python DataFrame: One-Hot Encode Rows Containing a Specific Substring

Question

2 answers

solution1 2 2021-04-26 16:48:09

solution2 1 ACCPTED 2021-04-26 16:45:11

solution1
2 2021-04-26 16:48:09

solution2
1 ACCPTED 2021-04-26 16:45:11