简体   繁体   English

从服务器链接下载 csv 压缩文件并读入 pandas

[英]Download zipped csv file from server link and read into pandas

I have been trying to download a zipped csv using the requests library from a server host URL.我一直在尝试使用来自服务器主机 URL 的请求库下载压缩的 csv。

When I download a smaller file not requiring compression from the same server it has no problem reading in the CSV, but with this one I return encoding errors.当我从同一服务器下载一个不需要压缩的较小文件时,它在 CSV 中读取没有问题,但是使用这个我返回编码错误。

I have tried multiple types of encoding, reading in as pandas csv, reading in as zip file and opening (at which point I get the error that file is not a zip file). I have tried multiple types of encoding, reading in as pandas csv, reading in as zip file and opening (at which point I get the error that file is not a zip file).

I have additionally tried using the zipfile library as sugggested here: Reading csv zipped files in python我还尝试使用此处建议的 zipfile 库: Reading csv zipped files in python

and have also tried setting both encoding and compression in read_csv .并且还尝试在read_csv中设置编码和压缩。

The code which works for the non-zipped server file is below:适用于非压缩服务器文件的代码如下:

response = requests.get(url, auth=HTTPBasicAuth(un, pw), stream=True, verify = False)
dfs = pd.read_csv(response.raw)

but returns 'utf-8' codec can't decode byte 0xfd in position 0: invalid start byte when used for this file.但返回'utf-8' codec can't decode byte 0xfd in position 0: invalid start byte

I have also tried:我也试过:

request = get(url, auth=HTTPBasicAuth(un, pw), stream=True, verify=False)
zip_file = ZipFile(BytesIO(request.content))
files = zip_file.namelist()
with gzip.open(files[0], 'rb') as csvfile:
    csvreader = csv.reader(csvfile)
    for row in csvreader:
        print(row)

which returns a seek attribute error.它返回一个 seek 属性错误。

Here is one way to do it:这是一种方法:

import pandas as pd
import requests
from requests.auth import HTTPBasicAuth
from zipfile import ZipFile
import io

# Example dataset
url = 'https://www.stats.govt.nz/assets/Uploads/Retail-trade-survey/Retail-trade-survey-September-2020-quarter/Download-data/retail-trade-survey-september-2020-quarter-csv.zip'

response = requests.get(url, auth=HTTPBasicAuth(un, pw), stream=True, verify=False)
with ZipFile(io.BytesIO(response.content)) as myzip:
    with myzip.open(myzip.namelist()[0]) as myfile:
        df = pd.read_csv(myfile)

print(df)

If you want to read a specific csv in a multiple-csv zip file, replace myzip.namelist()[0] with the file you want to read.如果要读取多 csv zip 文件中的特定 csv,请将myzip.namelist()[0]替换为您要读取的文件。 If you don't know its name, you can check the zip file content with print(ZipFile(io.BytesIO(response.content)))如果不知道它的名字,可以用print(ZipFile(io.BytesIO(response.content)))查看zip文件内容

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM