[英]one-hot encoding, access list elements
I have a .csv file with data of which i want to transform some columns to one-hot.我有一个包含数据的 .csv 文件,我想将其中的一些列转换为 one-hot。 The problem occurs in the second last line, where the one-hot index (eg 1st feature) gets placed in all rows instead of just the one i am in currently.
问题出现在倒数第二行,其中 one-hot 索引(例如第一个特征)被放置在所有行中,而不仅仅是我当前所在的行。 It seems to be some problem with how i access the 2D list... any suggestions?
我如何访问二维列表似乎有些问题......有什么建议吗? thank you
谢谢你
def one_hot_encode(data_list, column):
one_hot_list = [[]]
different_elements = []
for row in data_list[1:]: # count different elements
if row[column] not in different_elements:
different_elements.append(row[column])
for i in range(len(different_elements)): # set variable names
one_hot_list[0].append(different_elements[i])
vector = [] # create list shape with zeroes
for i in range(len(different_elements)):
vector.append(0)
for i in range(1460):
one_hot_list.append(vector)
ind_row = 1 # encode 1 for each sample
for row in data_list[1:]:
index = different_elements.index(row[column])
one_hot_list[ind_row][index] = 1 # mistake!! sets all rows to 1
ind_row += 1
Your problem stems from the vector
object you're creating to do the one-hot encoding;您的问题源于您创建的
vector
对象以进行单热编码; you've created one object, and then built a one_hot_list
that contains 1460 references to the same object.您已经创建了一个对象,然后构建了一个
one_hot_list
,其中包含对同一对象的 1460 个引用。 When you make a change in one of the rows, it will be reflected in all of the rows.当您在其中一行中进行更改时,它将反映在所有行中。
Quick solution would be to create separate copies of the vector
for each row (See How to clone or copy a list? ):快速解决方案是为每一行创建
vector
单独副本(请参阅如何克隆或复制列表? ):
one_hot_list.append(vector[:])
Some of the other things you're doing in your function are a bit slow or roundabout.您在函数中所做的其他一些事情有点缓慢或迂回。 I'd suggest a few changes:
我建议做一些改变:
def one_hot_encode(data_list, column):
one_hot_list = [[]]
# count different elements
different_elements = set(row[column] for row in data_list[1:])
# convert different_elements to a list with a canonical order,
# store in the first element of one_hot_list
one_hot_list[0] = sorted(different_elements)
vector = [0] * len(different_elements) # create list shape with zeroes
one_hot_list.extend([vector[:] for _ in range(1460)])
# build a mapping of different_element values to indices into
# one_hot_list[0]
index_lookup = dict((e,i) for (i,e) in enumerate(one_hot_list[0]))
# encode 1 for each sample
for rindex, row in enumerate(data_list[1:], 1):
cindex = index_lookup[row[column]]
one_hot_list[rindex][cindex] = 1
This builds different_elements
in linear time by using the set
data type, and uses list comprehensions to produce the values for one_hot_list[0]
(the list of element values which are being one-hot encoded), the zero vector
, and one_hot_list[1:]
(which is the actual one-hot-encoded matrix value).这通过使用
set
数据类型在线性时间内构建different_elements
,并使用列表one_hot_list[0]
生成one_hot_list[0]
(正在one_hot_list[1:]
热编码的元素值列表)、零vector
和one_hot_list[1:]
(这是实际的单热编码矩阵值)。 Also, there's a dict
called index_lookup
that lets you quickly map element values onto their integer index, instead of searching for them over and over again.此外,还有一个名为
index_lookup
的dict
,它可以让您快速将元素值映射到它们的整数索引上,而不是一遍又一遍地搜索它们。 Finally, your row index into the one_hot_list
matrix can be managed for you by enumerate
.最后,您可以通过
enumerate
为您管理one_hot_list
矩阵中的行索引。
I'm not 100% sure of what you are trying to do but the problem you are seeing is in these lines:我不是 100% 确定您要做什么,但您看到的问题在于以下几行:
for i in range(1460):
one_hot_list.append(vector)
These are creating the one_hot_list
as 1460 references to the same vector of zeros.这些正在创建
one_hot_list
作为对相同零向量的 1460 个引用。 Whereas I think you want it to be a new vector each time.而我认为你希望它每次都成为一个新的载体。 A direct fix would just be to copy it each time:
一个直接的解决方法就是每次都复制它:
for i in range(1460):
one_hot_list.append(vector[:])
But a more Pythonic approach would be to create the list with a comprehension.但更 Pythonic 的方法是创建具有理解力的列表。 Perhaps something like this:
也许是这样的:
vector_size = len(different_elements):
one_hot_list = [ [0] * vector_size for i in range(1460)]
您可以使用 set() 来计算列表中的唯一项目
different_elements = list(set(data[1:]))
I suggest you save yourself from the hassle of re-implementing this in plain Python.我建议您避免在纯 Python 中重新实现它的麻烦。 You can use use
pandas.get_dummies
for this:您可以为此使用
pandas.get_dummies
:
First some test data ( test.csv
):首先是一些测试数据(
test.csv
):
A
Foo
Bar
Baz
Then in Python:然后在 Python 中:
import pandas as pd
df = pd.read_csv('test.csv')
# convert column 'A' to one-hot encoding
pd.get_dummies(df['A'])
You can retrieve the underlying numpy
array using:您可以使用以下方法检索底层
numpy
数组:
pd.get_dummies(df['A']).values
Which results in:结果是:
array([[0, 0, 1],
[1, 0, 0],
[0, 1, 0]], dtype=uint8)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.