简体   繁体   English

python pandas 使用 keras 规范化列,然后拆分为组

[英]python pandas normalize column with keras, then splitting to groups

Having the following data frame (actual data frame contains multiple strings and numeric columns):具有以下数据框(实际数据框包含多个字符串和数字列):

col1    col2
0   A   10
1   A   10
2   B   5
3   B   5

I want to normalize the data based on column values so the result would look like this:我想根据列值对数据进行规范化,因此结果如下所示:

    col1    col2
0   A           0.632456
1   A           0.632456
2   B           0.316228
3   B           0.316228

And then split it to groups to get:然后将其拆分为组以获得:

    col1    col2
0   A           0.632456
1   A           0.632456

    col1    col2
0   B           0.316228
1   B           0.316228

Splitting to groups is easy however I'm struggling with the normalization.拆分为组很容易,但是我正在努力实现标准化。 I've tried using the following code:我尝试使用以下代码:

from keras.utils import normalize
df = pd.DataFrame({"col1":["A","A","B","B"],"col2":[10,10,5,5]})
normalize(df, axis=0)

But since I have strings it fails, it will work if the values of A and B would be numeric.但由于我有字符串,它会失败,如果 A 和 B 的值是数字,它将起作用。

Q: How can I normalize the numeric values by columns without dropping the string columns so I can later group by?问:如何在不删除字符串列的情况下按列标准化数值,以便以后可以分组?

When dealing with categorical data, you should be looking at encoding methods such as a OneHotEncoder .在处理分类数据时,您应该查看诸如OneHotEncoder类的编码方法。 It doesn't make sense to try to normalize these columns directly.尝试直接对这些列进行规范化是没有意义的。 In this case, you could use a scaler such as MinMaxScaler for the numerical columns (or keras' Normalize ), and then one hot encode the categorical columns as:在这种情况下,您可以对数值列(或 keras 的Normalize )使用诸如MinMaxScaler之类的缩放器,然后将分类列热编码为:

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

sc = MinMaxScaler()
oh = OneHotEncoder()

col2_norm = sc.fit_transform(df.col2.to_numpy()[:,None])
col1_one_hot = oh.fit_transform(df.col1.to_numpy()[:,None]).toarray()

np.concatenate([col1_one_hot, col2_norm], axis=1)
array([[1., 0., 1.],
       [1., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.]])

If you just want to normalize the categorical column, you can just feed a Series to the scaler, rather than the entire dataframe:如果您只想规范化分类列,您可以将一个Series提供给缩放器,而不是整个 dataframe:

sc = MinMaxScaler()
df['col2'] = sc.fit_transform(df.col2.to_numpy()[:,None])

Or similarly with keras' normalize :或者与 keras 的normalize类似:

df['col2'] = normalize(df.col2.to_numpy()).squeeze()

print(df)

  col1  col2
0    A   1.0
1    A   1.0
2    B   0.0
3    B   0.0    ​

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM