简体   繁体   English

当我使用 pd.crosstab 时,它一直显示 AssertionError

[英]When I use pd.crosstab it keeps showing AssertionError

When I use pd.crosstab to build confusion matrices, it keeps showing当我使用pd.crosstab构建混淆矩阵时,它一直显示

AssertionError: arrays and names must have the same length

import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
import random

df = pd.read_csv('C:\\Users\\liukevin\\Desktop\\winequality-red.csv',sep=';', usecols=['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol','quality'])

Q=[]

for i in range(len(df)):
    if df['quality'][i]<=5:
        Q.append('Low')
    else:
        Q.append('High')

del df['quality']
test_number=sorted(random.sample(xrange(len(df)), int(len(df)*0.25)))
train_number=[]
temp=[]
for i in range(len(df)):
    temp.append(i)
train_number=list(set(temp)-set(test_number))

distance_all=[]
for i in range(len(test_number)):
    distance_sep=[]
    for j in range(len(train_number)):
        distance=pow(df['fixed acidity'][test_number[i]]-df['fixed acidity'][train_number[j]],2)+\
        pow(df['volatile acidity'][test_number[i]]-df['volatile acidity'][train_number[j]],2)+\
        pow(df['citric acid'][test_number[i]]-df['citric acid'][train_number[j]],2)+\
        pow(df['residual sugar'][test_number[i]]-df['residual sugar'][train_number[j]],2)+\
        pow(df['chlorides'][test_number[i]]-df['chlorides'][train_number[j]],2)+\
        pow(df['free sulfur dioxide'][test_number[i]]-df['free sulfur dioxide'][train_number[j]],2)+\
        pow(df['total sulfur dioxide'][test_number[i]]-df['total sulfur dioxide'][train_number[j]],2)+\
        pow(df['density'][test_number[i]]-df['density'][train_number[j]],2)+\
        pow(df['pH'][test_number[i]]-df['pH'][train_number[j]],2)+\
        pow(df['sulphates'][test_number[i]]-df['sulphates'][train_number[j]],2)+\
        pow(df['alcohol'][test_number[i]]-df['alcohol'][train_number[j]],2)
        distance_sep.append(distance)
    distance_all.append(distance_sep)

for round in range(5):
    K=2*round+1

    select_neighbor_all=[]
    for i in range(len(test_number)):
        select_neighbor_sep=np.argsort(distance_all[i])[:K]
        select_neighbor_all.append(select_neighbor_sep)

    prediction=[]
    Q_test=[]
    for i in range(len(test_number)):
        Q_test.append(Q[test_number[i]])
        #original data
        Low_count=0
        for j in range(K):
            if Q[train_number[select_neighbor_all[i][j]]]=='Low':
                Low_count+=1
        if Low_count>(K/2):
            prediction.append('Low')
        else:
            prediction.append('High')

    print pd.crosstab(Q_test, prediction, rownames=['Actual'], colnames=['Predicted'], margins=True)

But aren't the length of Q_test and prediction the same?但是Q_testprediction的长度不一样吗? I guess it might be the problem that "names" must have the same length because I am not really sure what it means.我想这可能是"names" must have the same length的问题,因为我不确定它的含义。 (In Q_test and prediction arrays, there are only binary elements 'Low' and 'High' .) select_neighbor_all is what I did to select K nearest neighbors of ith test data. (在Q_testprediction数组中,只有二进制元素'Low''High' 。) select_neighbor_all是我为选择ith测试数据的 K 个最近邻居所做的。

It appears that you may not be providing all the data that pd.crosstab needs to perform the necessary calculations:看来您可能没有提供 pd.crosstab 执行必要计算所需的所有数据:

Take a look at this example.看看这个例子。 Here we provide an index AND two column categories AND rownames and colnames:这里我们提供了一个索引 AND 两个列类别 AND 行名和列名:

>>> index = np.array(["foo", "foo", "foo", "foo", "bar", "bar",
...                   "bar", "bar", "foo", "foo", "foo"], dtype=object)
>>> col_category_1 = np.array(["one", "one", "one", "two", "one", "one",
...                            "one", "two", "two", "two", "one"], dtype=object)
>>> col_category_2 = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny",
...                            "shiny", "dull", "shiny", "shiny", "shiny"],
...                            dtype=object)


# Notice the index AND the columns provided as a list    
>>> pd.crosstab(index, [col_category_1, col_category_2], 
                    rownames=['a'], colnames=['b', 'c'])
... 
col_category_1   one        two
col_category_2   dull shiny dull shiny
index
bar              1     2    1     0
foo              2     2    1     2

For more details, see the pandas documentation for pd.crosstab :有关更多详细信息,请参阅pd.crosstabpandas 文档

index : array-like, Series, or list of arrays/Series Values to group by in the rows index : 在行中分组的类数组、系列或数组/系列值列表

columns : array-like, Series, or list of arrays/Series Values to group by in the columns columns :列中要分组的类数组、系列或数组/系列值列表

rownames : sequence, default None If passed, must match number of row arrays passed rownames : 序列,默认 None 如果通过,必须匹配传递的行数组数

colnames : sequence, default None If passed, must match number of column arrays passed colnames : 序列,默认 None 如果通过,必须匹配传递的列数组数

If you edit the following line, and include the correct inputs, it should solve your problem...如果您编辑以下行并包含正确的输入,它应该可以解决您的问题...

# You will need to provide an index and columns...
# Here, 'Q_test' is being interpreted as your index
# 'prediction' is being used as a column... 
pd.crosstab(Q_test, prediction, 
            rownames=['Actual'], 
            colnames=['Predicted'],
            margins=True)

I just spent some time on resolving this.我只是花了一些时间来解决这个问题。 In my case it was that pandas crosstab does not seem to work with lists.就我而言,熊猫交叉表似乎不适用于列表。

If you convert your lists to numpy arrays it should work fine.如果您将列表转换为 numpy 数组,它应该可以正常工作。

So in your case it would be:所以在你的情况下,它将是:

pd.crosstab(np.array(Q_test), np.array(prediction), rownames=['Actual'],
            colnames=['Predicted'], margins=True)

An example:一个例子:

>>> import pandas as pd
>>> import numpy as np
>>> classifications = ['foo', 'bar', 'foo', 'bar']
>>> predictions = ['foo', 'foo', 'bar', 'bar']
>>> pd.crosstab(classifications, predictions, rownames=['Actual'], colnames=['Predicted'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bastian/miniconda3/envs/machine_learning/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 563, in crosstab
    rownames = _get_names(index, rownames, prefix="row")
  File "/home/bastian/miniconda3/envs/machine_learning/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 703, in _get_names
    raise AssertionError("arrays and names must have the same length")
AssertionError: arrays and names must have the same length
>>> pd.crosstab(np.array(classifications), np.array(predictions), rownames=['Actual'], colnames=['Predicted'])
Predicted  bar  foo
Actual             
bar          1    1
foo          1    1

This happens because some operations like multiplication have different effects on lists than on numpy arays, I think.发生这种情况是因为我认为,乘法之类的某些操作对列表的影响与对 numpy 数组的影响不同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM