简体   繁体   English

Pandas read_csv 在读取 gzip 文件时抛出 ValueError

[英]Pandas read_csv throws ValueError while reading gzip file

I am trying to read a gzip file using pandas.read_csv like so:我正在尝试使用pandas.read_csv读取 gzip 文件,如下所示:

import pandas as pd
df = pd.read_csv("data.ZIP.gz", usecols=[*range(0, 39)], encoding="latin1", skipinitialspace=True)

But it throws this error:但它抛出了这个错误:

ValueError: Passed header names mismatches usecols ValueError: Passed header 名称不匹配 usecols

However, if I manually extract the zip file from gz file, then read_csv if able to read the data without errors:但是,如果我从 gz 文件中手动提取 zip 文件,那么read_csv是否能够正确读取数据:

df = pd.read_csv("data.ZIP", usecols=[*range(0, 39)], encoding="latin1", skipinitialspace=True)

Since I have to read a lot of these files I don't want to manually extract them.由于我必须阅读很多这些文件,我不想手动提取它们。 So, how can I fix this error?那么,我该如何解决这个错误呢?

You have two levels of compression - gzip and zip - but pandas know how to work with only one level of compression.您有两个压缩级别 - gzipzip - 但 pandas 知道如何只使用一个压缩级别。

You can use module gzip and zipfile with io.BytesIO to extract it to file-like object in memory.您可以将模块gzipzipfileio.BytesIO一起使用,以将其解压缩为file-like object


Here minimal working code这里最小的工作代码

It can be useful if zip has many files and you want to select which one to extract如果zip有很多文件并且您想要 select 提取哪一个文件,它会很有用

import pandas as pd
import gzip
import zipfile
import io

with gzip.open('data.csv.zip.gz') as f1:
    data = f1.read()

file_like_object_1 = io.BytesIO(data)

with zipfile.ZipFile(file_like_object_1) as f2:
    #print([x.filename for x in f2.filelist])  # list all filenames
    #data = f2.read('data.csv')                # extract selected filename
    #data = f2.read(f2.filelist[0])            # extract first file
    data = f2.read(f2.filelist[0].filename)    # extract first file

file_like_object_2 = io.BytesIO(data)

df = pd.read_csv(file_like_object_2)

print(df)

But if zip has only one file then you can use read_csv to extract it - it needs to add option compression='zip' because file-like object has no filename and read_csv can't use filename's extension to recognize compressed file.但是如果zip只有一个文件,那么您可以使用read_csv来提取它 - 它需要添加选项compression='zip'因为file-like object没有文件名,并且read_csv不能使用文件名的扩展名来识别压缩文件。

import pandas as pd
import gzip
import io

with gzip.open('data.csv.zip.gz') as f1:
    data = f1.read()

file_like_object_1 = io.BytesIO(data)

df = pd.read_csv(file_like_object_1, compression='zip')

print(df)

use the gzip module to unzip all your files somethings like this使用gzip模块解压缩所有文件,如下所示

 for file in list_file_names:
    file_name=file.replace(".gz","")
    with gzip.open(file, 'rb') as f:
        file_content = f.read()
        with open(file_name,"wb") as r:
            r.write(file_content)

You can use zipfile module, such as:您可以使用zipfile模块,例如:

import zipfile
with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM