简体   繁体   English

使用字典替换 NumPy 数组中的值会给出模棱两可的结果,这是为什么呢?

[英]Substituting values in NumPy array using a dictionary is giving ambiguous results, why is that?

So, I have an array with some words an I'm trying to perform one-hot encoding.所以,我有一个包含一些单词的数组,我正在尝试执行一次性编码。

Let's say the input is AI DSA DSA AI ML ML AI DS DS AI C AI ML ML C假设输入是AI DSA DSA AI ML ML AI DS DS AI C AI ML ML C

This is my code:这是我的代码:

def apply_one_hot_encoding(X):
    dic = {}
    k = sorted(list(set(X)))
    for i in range(len(k)):
        arr = ['0' for i in range(len(k))]
        arr[i] = '1'
        dic[k[i]] = ''.join(arr)
    
    for i in range(len(X)):
        t = dic[X[i]]
        X[i] = t
         
    return X

if __name__ == "__main__":
    X = np.array(list(input().split()))
    
    one_hot_encoded_array = apply_one_hot_encoding(X)
    for i in one_hot_encoded_array:
        print(*i)

Now, I would expect the output to be like:现在,我希望 output 像:

1 0 0 0 0 
0 0 0 1 0 
0 0 1 0 0 

But what I'm getting is:但我得到的是:

1 0 0
0 0 1
1 0 0

If I append the t values to another list and return that, it is giving the right results.如果我 append 将t值发送到另一个列表并将其返回,则它给出了正确的结果。

Why is the assigned value being trimmed to just 3 characters in case of direct substitution?为什么在直接替换的情况下将分配的值修剪为仅 3 个字符?

The problem is caused due to the dtype (datatype) of the Numpy array.该问题是由于 Numpy 数组的dtype (数据类型)引起的。

When you check the datatype of the numy array in the above program using print(X.dtype) , it shows the data type as <U3 which can hold only three characters for each element in the numpy array X .当您使用print(X.dtype)检查上述程序中 numy 数组的数据类型时,它显示数据类型为<U3 ,它只能为 numpy 数组X中的每个元素保存三个字符。

Since input array contains five categories, the dtype of the array can be changed into <U5 by X = np.array(list(input().split()), dtype='<U5') which can hold upto five characters for each element in the numpy array X .由于输入数组包含五个类别,因此数组的dtype可以通过X = np.array(list(input().split()), dtype='<U5')更改为<U5 ,最多可以容纳五个字符numpy 数组X中的每个元素。

The corrected code is,更正后的代码是,

def apply_one_hot_encoding(X):
    dic = {}
    k = sorted(list(set(X)))
    for i in range(len(k)):
        arr = ['0' for i in range(len(k))]
        arr[i] = '1'
        dic[k[i]] = ''.join(arr)
    
    for i in range(len(X)):
        t = dic[X[i]]
        X[i] = t
         
    return X

if __name__ == "__main__":
    X = np.array(list(input().split()),dtype = '<U5')
    
    one_hot_encoded_array = apply_one_hot_encoding(X)
    for i in one_hot_encoded_array:
        print(*i)

The above method is not needed when you store the values in the separate numpy array since the numpy changes the datatype automatically according the size of strings,当您将值存储在单独的 numpy 数组中时,不需要上述方法,因为 numpy 会根据字符串的大小自动更改数据类型,

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM