简体   繁体   中英

How does pandas get_dummies convert values

I have this column

df["Pclass"].tail()

Pclass

2
1
3
1
3

I created dummies of the column

dummies = pd.get_dummies(df["Pclass"],prefix="Pclass")
df = pd.concat([df,dummies],axis=1)

result

df["Pclass_1"].tail()


    Pclass_1    Pclass_2    Pclass_3
886   0             1         0
887   1             0         0
888   0             0         1
889   1             0         0
890   0             0         1

I don't quite get it. After which rules the numbers in the column are transformed into the 1s and 0s.

pd.get_dummies

It basically pivots each unique value of the category's to it's own column and has a boolean flag ( 1 or 0 ) to flag which categorical value was present on that row.

Let's look at a less abstract example:

df = pd.DataFrame({'sex':['male', 'female', 'unknown', 'female']})

       sex
0     male
1   female
2  unknown
3   female

df.join(pd.get_dummies(df['sex'], prefix='sex'))

       sex  sex_female  sex_male  sex_unknown
0     male           0         1            0
1   female           1         0            0
2  unknown           0         0            1
3   female           1         0            0

As you can see, first row in our original column is male and in our dummies column sex_male we see that there's a flag 1 .

       sex  sex_female  sex_male  sex_unknown
0     male           0         1            0

Then on the second row, in our original column the value is female and we see in our dummies column sex_female has flag 1 :

       sex  sex_female  sex_male  sex_unknown
1   female           1         0            0

And so on.

What's also important to remember is that when you apply pd.get_dummies :

amount of new dummie columns = amount of unique values in original caterogical column


In machine learning terms, we call this one-hot encoding

With scikit-learn it would look as followed:

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
encoder.fit_transform(df['sex'].to_numpy().reshape(-1,1)).toarray()

array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

Predictive models that depend on numeric inputs cannot directly handle open text fields or categorical attributes. Instead, these information-rich data need to be processed prior to presenting the information to a model. Tree-based and Naive Bayes models are exceptions; most models require that the predictors take numeric form.

Creating Dummy Variables for Unordered Categories is an approach for transforming categorical attributes to numerical. @Erfan has answered what dummy variables do. But take the case of encoding ordered attributes: An unordered predictor with C categories can be represented by C−1 binary dummy variables or a hashed version of binary dummy variables. These methods effectively present the categorical information to the models.

But now suppose that the C categories have a relative ordering. For example, consider a predictor that has the categories of “low”, “medium”, and “high.” Creating dummy attributes as done for Unordered Data would miss the information contained in the relative ordering.

For ordered data encoding:

  • Polynomial Contrast: A contrast is a linear combination of variables (parameters or statistics) whose coefficients add up to zero, allowing comparison of different treatments.
  • Treat the predictors as unordered factors. If the true underlying pattern is linear or quadratic, unordered dummy variables may not effectively uncover this trend.
  • Translate the ordered categories into a single set of numeric scores based on context-specific information.

It makes a dummy column for each value that appeared in the original column, and then for each row puts a 1 if that row had the value corresponding to the dummy column and a 0 otherwise.

The row 886 had a 2 in column Pclass, so that is converted to a 1 in column Pclass_2 and a 0 in all other dummy columns.

Row 887 had a 1 in column Pclass, so that is converted to a 1 in column Pclass_1and a 0 in all other dummy columns.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM