简体   繁体   English

使用 pandas 从 zip 读取特定的 csv 文件

[英]Read specific csv file from zip using pandas

Here is a data I am interested in.这是我感兴趣的数据。

http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip

It consists of 3 files:它由3个文件组成:

在此处输入图像描述

I want to download zip with pandas and create DataFrame from 1 file called Production_Crops_E_All_Data.csv我想用 pandas 下载 zip 并从名为 Production_Crops_E_All_Data.Z632719675AEFFE52 的 1 个文件中创建 DataFrame

import pandas as pd
url="http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip"
df=pd.read_csv(url)

Pandas can download files, it can work with zips and of course it can work with csv files. Pandas 可以下载文件,它可以使用 zip,当然它也可以使用 csv 文件。 But how can I work with 1 specific file in archive with many files?但是,如何使用包含许多文件的存档中的 1 个特定文件?

Now I get error现在我得到错误

ValueError: ('Multiple files found in compressed zip file %s) ValueError: ('在压缩的 zip 文件 %s 中找到多个文件)

This post doesn't answer my question bcause I have multiple files in 1 zip Read a zipped file as a pandas DataFrame这篇文章没有回答我的问题,因为我在 1 个 zip 中有多个文件将压缩文件读取为 pandas DataFrame

You could use python's datatable , which is a reimplementation of Rdatatable in python.您可以使用 python 的datatable ,它是python中 Rdatatable 的重新实现。

Read in data:读入数据:

from datatable import fread

#The exact file to be extracted is known, simply append it to the zip name:
 url = "Production_Crops_E_All_Data.zip/Production_Crops_E_All_Data.csv"

 df = fread(url)

#convert to pandas

 df.to_pandas()

You can equally work within datatable;您同样可以在数据表中工作; do note however, that it is not as feature-rich as Pandas;但是请注意,它不像 Pandas 那样功能丰富; but it is a powerful and very fast tool.但它是一个强大且非常快速的工具。

Update: You can use the zipfile module as well:更新:您也可以使用zipfile模块:

from zipfile import ZipFile
from io import BytesIO

with ZipFile(url) as myzip:
    with myzip.open("Production_Crops_E_All_Data.csv") as myfile:
        data = myfile.read()

#read data into pandas
#had to toy a bit with the encoding,
#thankfully it is a known issue on SO
#https://stackoverflow.com/a/51843284/7175713
df = pd.read_csv(BytesIO(data), encoding="iso-8859-1", low_memory=False)

From this link 从这个链接

EDIT: updated for python3 StringIO to io.StringIO编辑:将 python3 StringIO 更新为 io.StringIO

EDIT: updated the import of urllib, changed usage of StringIO to BytesIO.编辑:更新了 urllib 的导入,将 StringIO 的使用更改为 BytesIO。 Also your CSV files are not utf-8 encoding, I tried latin1 and that worked.另外,您的 CSV 文件不是 utf-8 编码,我尝试了 latin1 并且有效。

try this尝试这个

from zipfile import ZipFile
import io
from urllib.request import urlopen
import pandas as pd

r = urlopen("http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip").read()
file = ZipFile(io.BytesIO(r))
data_df = pd.read_csv(file.open("Production_Crops_E_All_Data.csv"), encoding='latin1')
data_df_noflags = pd.read_csv(file.open("Production_Crops_E_All_Data_NOFLAG.csv"), encoding='latin1')
data_df_flags = pd.read_csv(file.open("Production_Crops_E_Flags.csv"), encoding='latin1')

Hope this helps!希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM