简体   繁体   中英

python csv file reading: turning the first row into column headers, next(reader) returns unwanted characters

Currently I'm writing some code to read in csv files with pandas and I need the first row of the file to be read into a list in order to use it for some descriptives (see code Part1). I can just use the pandas.read_csv Parameter header=0 , which reads out column headers automatically, but it does not return a list afaik. In the comment in print() , names is the list that I used to manually pass column headers to pandas.read_csv but I'd like to have that be automatic (so when I add/delete columns I don't have to edit the array of names manually).

So, to work around this, I came up with the idea to just separately read in the first row using csv.reader and get a list with column names that I can use in pandas.read_csv that way (see code Part2).

Part1 pandas csv reading and printing descriptives of the data

import pandas as pd
filename = 'test.csv'
dataheadsize = 10
data = pd.read_csv(filename, sep=";", header=0, decimal=",") 

used to pass list of names here instead of header=0

print('Descriptives:\n', data.describe(), '\n\n',
'Datasheet (', dataheadsize, 'rows shown):\n', data.head(dataheadsize),
#'Count per class:\n',data.groupby(names[0]).size(),'\n\n',
)

Part2 trying to get the first row of the csv to be read into a list

import csv
file = open(filename, 'r')
reader = csv.reader(file, delimiter=';')
names = next(reader)
print(names)

This gives me the list that I need but for some reason it reads in some additional unwanted characters at index [0]. this is what is returned by print() :

['VAR00001', 'VAR00002', 'VAR00003']

As you can see, I don't want those '  ' characters to be returned and I wonder what the best method is to circumvent that, and I'd like it to be as automatic as possible for future uses, which is why I don't want to just remove the characters by slicing because I don't know if those characters change depending on the csv file, if the amount of them changes, etc.

As a reference, this is the first 5 rows of the .csv file:

VAR00001;VAR00002;VAR00003
1;2;4
1;2;4
0;5;4
0;1;4

As you can probably tell by now, I'm not the most experienced coder, so if there's a way to skip the whole 'separately reading in the csv just to get the column names into a list' part, please do let me know, because I couldn't figure that out!

我不知道为什么要添加这些字符,但为什么不尝试:

list(data.keys())

If all else fails you can manually remove it.

def FixHeader(headerArr):
    newHeaderArr = []
    for i in range(len(headerArr)):
        if i == 0: 
            newHeaderArr.append(headerArr[i][1:])
            # 1 being how many chars you want to remove
        else:
            newHeaderArr.append(headerArr[i])
    #print(newHeaderArr)
    return newHeaderArr

You can use the nrows argument to pd.read_csv to read in column labels separately:

# read in column labels as list
cols = pd.read_csv('file.csv', nrows=0).columns.tolist()

# read in data; use default pd.RangeIndex, i.e. 0, 1, 2, etc., as columns
data = pd.read_csv('file.csv', header=None, skiprows=[0])

If you need to specify an encoding, you can do so via the encoding argument, eg encoding='latin-1' .

Thanks for the rapid replies guys!

Just fyi, when I change the encoding to utf-8 I get this list

['\VAR00001', 'VAR00002', 'VAR00003']

and when I use latin-1 it doesn't change anything compared to the list I originally posted. I'm sure this would work, though, given I figure out the correct Encoding.

However, I'm using list(data.keys()) as it was suggested and that works like a charm while also completely removing the need to read in anything separately. Thanks a bunch to everyone who responded!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM