简体   繁体   中英

Access only once to a csv file with header using pd.read_csv() in Python

I have to deal with csv file on a distant server, so it takes a very long time.

My csv file is such that the 8 first rows contains a kind of header formatted like key : value . Then, at the ninth line comes the columns index formatted as a usual csv file.

Since it is long to access the file, I want to open it only once, but I don't know how to it. Indeed, from what I understood pd.read_csv() only takes a file as input, not only its content. So this is where I am for the moment :

import pandas as pd

with open(r'myFile.csv', "r", encoding = "utf-8") as file:

    header = file.readlines()[:8]

    metaData = [value.split(':') for value in header]
    metaData = {value[0].strip() : value[1].strip() for value in metaData}

    data = pd.read_csv(file, sep=';', header = 8)

And the associated error message :

EmptyDataError: No columns to parse from file

Edit with a sample input csv file :

key1:value1
key2:value2
key3:value3
key4:value4
key5:value5
key6:value6
key7:value7
key8:value8
column1;column2;column3
values;values;values
values;values;values
values;values;values
values;values;values

Currently, your code reads the entire file when you retrieve the header. After that, the file pointer is at the end of the file, so pandas won't get anything more from the file. The trick is to only read the first 8 lines when you want the header, and then pass the partially-read file pointer to pd.read_csv, which will read the rest of it. Here's a simple change to your code to do that:

import pandas as pd

with open(r'myFile.csv', "r", encoding = "utf-8") as file:

    header = [file.readline() for x in range(8)]

    metaData = [value.split(':') for value in header]
    metaData = {value[0].strip() : value[1].strip() for value in metaData}

    data = pd.read_csv(file, sep=';')

You can create nested list and create DataFrame by constructor:

with open(r'myFile.csv', "r", encoding = "utf-8") as file:

    #get iterators for all lines
    data = file.readlines()

    #filter first 8 for metadata 
    metaData = [value.split(':') for value in data[:8]]
    metaData = {value[0].strip() : value[1].strip() for value in metaData}

    #read lines from 8+ lines
    datas = [value.strip().split(';') for value in data[8:]]
    print (datas)
    [['column1', 'column2', 'column3'], 
     ['values', 'values', 'values'],
     ['values', 'values', 'values'], 
     ['values', 'values', 'values'], 
     ['values', 'values', 'values']]

    df = pd.DataFrame(datas[1:], columns=datas[0])
    print (df)
      column1 column2 column3
    0  values  values  values
    1  values  values  values
    2  values  values  values
    3  values  values  values

I must be missing something in the question. Could you not use the following?

import pandas as pd

df = pd.read_csv('maxime.csv', sep=';', skiprows=7, header=1)
print (df)

Result is:

  column1 column2 column3
0  values  values  values
1  values  values  values
2  values  values  values
3  values  values  values

Another method I just found because I needed to check columns existence on line 9 :

import pandas as pd

with open(r'myFile.csv', "r", encoding = "utf-8") as file:

    init = file.tell()
    header = file.readlines()[:9]
    file.seek(init)

    metaData = [value.split(':') for value in header]
    metaData = {value[0].strip() : value[1].strip() for value in metaData}

    data = pd.read_csv(file, sep=';', header = 8)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM