简体   繁体   English

如何在 Python 中使用多个标签进行一次热编码?

[英]How to one hot encode with multiple labels in Python?

I have a table in CSV that looks something like this:我有一个 CSV 表格,看起来像这样:

id   attribute
1    Canada
1    United States
2    Germany
3    Canada
4    Germany
4    United States

I want to turn the table above into:我想把上面的表格变成:

id   attribute.Canada   attribute.UnitedStates   attribute.Germany
1    1.0                1.0                      0.0
2    0.0                0.0                      1.0
3    1.0                0.0                      0.0
4    0.0                1.0                      1.0

I wish to accomplish three things:我希望完成三件事:

  1. each row will have a unique ID每行都有一个唯一的 ID
  2. the values under the "attribute" label become column names that are hot encoded “属性”标签下的值成为热编码的列名
  3. export the new table back to CSV将新表导出回 CSV

I would only like to give you head start.我只想给你一个开端。 Take the unique values of attribute column and append it in an array having 'id' initialized earlier.获取属性列的唯一值,并将其附加到一个具有先前初始化的 'id' 的数组中。 Get the unique values of 'id' (ie 1 2 3 4) and add them as index and earlier array as column to initialize the dataframe.获取 'id' 的唯一值(即 1 2 3 4)并将它们添加为索引,将较早的数组添加为列以初始化数据帧。

Iterate through the unique values of 'id', doing so use regex read the lines starting with the 'id' value.遍历“id”的唯一值,这样做使用正则表达式读取以“id”值开头的行。 Extract the attribute values and frame the dictionary with value 1.0 and append it to the dataframe, later replace the NaN with 0.0.提取属性值并使用值 1.0 构建字典并将其附加到数据帧,稍后将 NaN 替换为 0.0。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM