简体   繁体   中英

Python read csv file columns into lists, ignoring headers

I have a file 'data.csv' that looks something like

ColA, ColB, ColC
1,2,3
4,5,6
7,8,9

I want to open and read the file columns into lists, with the 1st entry of that list omitted, eg

dataA = [1,4,7]
dataB = [2,5,8]
dataC = [3,6,9]

In reality there are more than 3 columns and the lists are very long, this is just an example of the format. I've tried:

csv_file = open('data.csv','rb')
csv_array = []

for row in csv.reader(csv_file, delimiter=','):
    csv_array.append(row)

Where I would then allocate each index of csv_array to a list, eg

dataA = [int(i) for i in csv_array[0]]

But I'm getting errors:

_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

Also it feels like a very long winded way of just saving data to a few lists...

Thanks!

edit:

Here is how I solved it:

import pandas as pd

df = pd.read_csv('data.csv', names = ['ColA','ColB','ColC']

dataA = map(int,(df.ColA.tolist())[1:3])

and repeat for the rest of the columns.

Just to spell this out for people trying to solve a similar problem, perhaps without Pandas, here's a simple refactoring with comments.

import csv

# Open the file in 'r' mode, not 'rb'
csv_file = open('data.csv','r')
dataA = []
dataB = []
dataC = []

# Read off and discard first line, to skip headers
csv_file.readline()

# Split columns while reading
for a, b, c in csv.reader(csv_file, delimiter=','):
    # Append each variable to a separate list
    dataA.append(a)
    dataB.append(b)
    dataC.append(c)

This does nothing to convert the individual fields to numbers (use append(int(a)) etc if you want that) but should hopefully be explicit and flexible enough to show you how to adapt this to new requirements.

Use Pandas:

import pandas as pd

df = pd.DataFrame.from_csv(path)
rows = df.apply(lambda x: x.tolist(), axis=1)

To skip the header, create your reader on a seperate line. Then to convert from a list of rows to a list of columns, use zip() :

import csv

with open('data.csv', 'rb') as f_input:
    csv_input = csv.reader(f_input)
    header = next(csv_input)
    data = zip(*[map(int, row) for row in csv_input])

print data

Giving you:

[(1, 4, 7), (2, 5, 8), (3, 6, 9)]

So if needed:

dataA = data[0]

Seems like you have OSX line endings in your csv file. Try saving the csv file as "Windows Comma Separated (.csv)" format.

There are also easier ways to do what you're doing with the csv reader:

csv_array = []
with open('data.csv', 'r') as csv_file:
    reader = csv.reader(csv_file)
    # remove headers
    reader.next() 
    # loop over rows in the file, append them to your array. each row is already formatted as a list.
    for row in reader:
        csv_array.append(row)

You can then set dataA = csv_array[0]

First if you read the csv file with csv.reader(csv_file, delimiter=','), you will still read the header.

csv_array[0] will be the header row -> ['ColA', ' ColB', ' ColC']

Also if you're using mac, this issues is already referenced here: CSV new-line character seen in unquoted field error

And I would recommend using pandas&numpy instead if you will do more analysis using the data. It read the csv file to pandas dataframe. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

use csv.DictReader() to select specific columns

dataA = []
dataB = []
with open('data.csv', 'r') as csv_file:
    csv_reader = csv.DictReader(csv_file, delimiter=',')    
    for row in csv_reader:
        dataA.append(row['ColA'])
        dataB.append(row['ColB'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM