简体   繁体   English

one-hot 编码,访问列表元素

[英]one-hot encoding, access list elements

I have a .csv file with data of which i want to transform some columns to one-hot.我有一个包含数据的 .csv 文件,我想将其中的一些列转换为 one-hot。 The problem occurs in the second last line, where the one-hot index (eg 1st feature) gets placed in all rows instead of just the one i am in currently.问题出现在倒数第二行,其中 one-hot 索引(例如第一个特征)被放置在所有行中,而不仅仅是我当前所在的行。 It seems to be some problem with how i access the 2D list... any suggestions?我如何访问二维列表似乎有些问题......有什么建议吗? thank you谢谢你

def one_hot_encode(data_list, column):
    one_hot_list = [[]]
    different_elements = []

    for row in data_list[1:]:                  # count different elements
        if row[column] not in different_elements:
            different_elements.append(row[column])

    for i in range(len(different_elements)):   # set variable names
        one_hot_list[0].append(different_elements[i])

    vector = []                              # create list shape with zeroes
    for i in range(len(different_elements)):
        vector.append(0)
    for i in range(1460):
        one_hot_list.append(vector)

    ind_row = 1                                # encode 1 for each sample
    for row in data_list[1:]:
        index = different_elements.index(row[column])
        one_hot_list[ind_row][index] = 1     # mistake!! sets all rows to 1
        ind_row += 1

Your problem stems from the vector object you're creating to do the one-hot encoding;您的问题源于您创建的vector对象以进行单热编码; you've created one object, and then built a one_hot_list that contains 1460 references to the same object.您已经创建了一个对象,然后构建了一个one_hot_list ,其中包含对同一对象的 1460 个引用。 When you make a change in one of the rows, it will be reflected in all of the rows.当您在其中一行中进行更改时,它将反映在所有行中。

Quick solution would be to create separate copies of the vector for each row (See How to clone or copy a list? ):快速解决方案是为每一行创建vector单独副本(请参阅如何克隆或复制列表? ):

one_hot_list.append(vector[:])

Some of the other things you're doing in your function are a bit slow or roundabout.您在函数中所做的其他一些事情有点缓慢或迂回。 I'd suggest a few changes:我建议做一些改变:

def one_hot_encode(data_list, column):
    one_hot_list = [[]]

    # count different elements
    different_elements = set(row[column] for row in data_list[1:])

    # convert different_elements to a list with a canonical order,
    # store in the first element of one_hot_list
    one_hot_list[0] = sorted(different_elements)

    vector = [0] * len(different_elements)   # create list shape with zeroes
    one_hot_list.extend([vector[:] for _ in range(1460)])

    # build a mapping of different_element values to indices into
    # one_hot_list[0]
    index_lookup = dict((e,i) for (i,e) in enumerate(one_hot_list[0]))
    # encode 1 for each sample
    for rindex, row in enumerate(data_list[1:], 1):
        cindex = index_lookup[row[column]]
        one_hot_list[rindex][cindex] = 1

This builds different_elements in linear time by using the set data type, and uses list comprehensions to produce the values for one_hot_list[0] (the list of element values which are being one-hot encoded), the zero vector , and one_hot_list[1:] (which is the actual one-hot-encoded matrix value).这通过使用set数据类型在线性时间内构建different_elements ,并使用列表one_hot_list[0]生成one_hot_list[0] (正在one_hot_list[1:]热编码的元素值列表)、零vectorone_hot_list[1:] (这是实际的单热编码矩阵值)。 Also, there's a dict called index_lookup that lets you quickly map element values onto their integer index, instead of searching for them over and over again.此外,还有一个名为index_lookupdict ,它可以让您快速将元素值映射到它们的整数索引上,而不是一遍又一遍地搜索它们。 Finally, your row index into the one_hot_list matrix can be managed for you by enumerate .最后,您可以通过enumerate为您管理one_hot_list矩阵中的行索引。

I'm not 100% sure of what you are trying to do but the problem you are seeing is in these lines:我不是 100% 确定您要做什么,但您看到的问题在于以下几行:

for i in range(1460):
    one_hot_list.append(vector)

These are creating the one_hot_list as 1460 references to the same vector of zeros.这些正在创建one_hot_list作为对相同零向量的 1460 个引用。 Whereas I think you want it to be a new vector each time.而我认为你希望它每次都成为一个新的载体。 A direct fix would just be to copy it each time:一个直接的解决方法就是每次都复制它:

for i in range(1460):
    one_hot_list.append(vector[:])

But a more Pythonic approach would be to create the list with a comprehension.但更 Pythonic 的方法是创建具有理解力的列表。 Perhaps something like this:也许是这样的:

vector_size = len(different_elements):
one_hot_list = [ [0] * vector_size for i in range(1460)]

您可以使用 set() 来计算列表中的唯一项目

 different_elements = list(set(data[1:]))

I suggest you save yourself from the hassle of re-implementing this in plain Python.我建议您避免在纯 Python 中重新实现它的麻烦。 You can use use pandas.get_dummies for this:您可以为此使用pandas.get_dummies

First some test data ( test.csv ):首先是一些测试数据( test.csv ):

A
Foo
Bar
Baz

Then in Python:然后在 Python 中:

import pandas as pd

df = pd.read_csv('test.csv')
# convert column 'A' to one-hot encoding
pd.get_dummies(df['A'])

桌子

You can retrieve the underlying numpy array using:您可以使用以下方法检索底层numpy数组:

pd.get_dummies(df['A']).values

Which results in:结果是:

array([[0, 0, 1],
       [1, 0, 0],
       [0, 1, 0]], dtype=uint8)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM