Reading CSV & Columns - KeyError: “None of [Int64Index([0, 1, 2, 3], dtype='int64')] are in the [columns]”

Question

I am having issues trying to generate a colinearity analysis on a simple DF (see below). My problem is that everytime I try to run the function, I retrieve the following error message:

KeyError: "None of [Int64Index([0, 1, 2, 3], dtype='int64')] are in the [columns]"

Below is the code I am using

read_training_set = pd.read_csv('C:\\Users\\rapha\\Desktop\\New test\\Classeur1.csv', sep=";")
training_set = pd.DataFrame(read_training_set)

print(training_set)

def calculate_vif_(X):
    thresh = 5.0
    variables = range(X.shape[1])

    for i in np.arange(0, len(variables)):
        vif = [variance_inflation_factor(X[variables].values, ix) for ix in range(X[variables].shape[1])]
        print(vif)

        maxloc = vif.index(max(vif))
        if max(vif) > thresh:
            print('dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
            del variables[maxloc]

    print('Remaining variables:')
    print(X.columns[variables])
    return X

X = training_set
X2 = calculate_vif_(X)

The DF on which I am trying to run my function looks like this.

   Year  Age  Weight  Size
0  2020   10     100   170
1  2021   11     101   171
2  2022   12     102   172
3  2023   13     103   173
4  2024   14     104   174
5  2025   15     105   175
6  2026   16     106   176
7  2027   17     107   177
8  2028   18     108   178

I have two guesses here; but not sure how to fix that anyway:

-Guess 1: the np.arrange is causing some sort of conflict with the header & columns which prevents the rest of the function of iterating through each column

-Guess 2: The problem comes from blankseperators, which prevents the function from jumping from one column to another properly. The problem is that my CSV file already has ";" seperators (I do not know exactly why to be honnest as I manually created the file and saved it as a regular CSV with "," separators").

Not sure how to fix the problem at this point, does anyone has insights here?

Best

Answer 1

The error is caused by this snippet X[variables].values . Convert variables , which is a range , to a list .

As an aside, the code is very confusing. Why are you calling np.arange when variables is already a range ? Why are you using a range of the number of columns to index rows?

It looks like from the comments above that you think you are indexing columns by column number, but you are actually indexing rows. Some of this confusion would be cleared up if you use loc`` or iloc``` to be explicit about what you are trying to index.

Answer 2

Got it, I revised the whole thing and seems to be working. See below how it looks.

Thanks a lot for the help

    variables = list(range(X.shape[1]))

    for i in variables:
        vif = [variance_inflation_factor(X.iloc[:, variables].values, ix)
               for ix in range(X.iloc[:, variables].shape[1])]

        maxloc = vif.index(max(vif))
        if max(vif) > thresh:
            print('dropping \'' + X.iloc[:, variables].columns[maxloc] +
                  '\' at index: ' + str(maxloc))
            del variables[maxloc]

    print('Remaining variables:')
    print(X.columns[variables])
    return X.iloc[:, variables]


X = training_set
X2 = calculate_vif_(X)```

Reading CSV & Columns - KeyError: “None of [Int64Index([0, 1, 2, 3], dtype='int64')] are in the [columns]”

Question

2 answers

solution1
1 2020-04-19 13:52:38

solution2
0 ACCPTED 2020-04-19 14:34:06

Reading CSV & Columns - KeyError: “None of [Int64Index([0, 1, 2, 3], dtype='int64')] are in the [columns]”

Question

2 answers

solution1 1 2020-04-19 13:52:38

solution2 0 ACCPTED 2020-04-19 14:34:06

solution1
1 2020-04-19 13:52:38

solution2
0 ACCPTED 2020-04-19 14:34:06