简体   繁体   English

从 URL 与 Z251D2BBFE9A3B95E56AZ91CEB30DC?6

[英]Reading only .csv file within a .zip from URL with Pandas?

There is a.csv file contained within a.zip file from a URL I am trying to read into a Pandas DataFrame; There is a.csv file contained within a.zip file from a URL I am trying to read into a Pandas DataFrame; I don't want to download the.zip file to disk but rather read the data directly from the URL.我不想将 .zip 文件下载到磁盘,而是直接从 URL 读取数据。 I realize that pandas.read_csv() can only do this if the.csv file is the only file contained in the.zip, however, when I run this:我意识到 pandas.read_csv() 只有在 .csv 文件是 .zip 中包含的唯一文件时才能执行此操作,但是,当我运行此文件时:

import pandas as pd

# specify zipped comma-separated values url
zip_csv_url = 'http://www12.statcan.gc.ca/census-recensement/2016/geo/ref/gaf/files-fichiers/2016_92-151_XBB_csv.zip'
df1 = pd.read_csv(zip_csv_url)

I get this:我明白了:

ValueError: Multiple files found in compressed zip file ['2016_92-151_XBB.csv', '92-151-g2016001-eng.pdf', '92-151-g2016001-fra.pdf']

The contents of the.zip appear to be arranged as a list; .zip 的内容出现排列成列表; I'm wondering how I can assign the new DataFrame (df1) as the only available.csv file in the.zip (as the.zip file from the URL I will be using would only ever have one.csv file within it). I'm wondering how I can assign the new DataFrame (df1) as the only available.csv file in the.zip (as the.zip file from the URL I will be using would only ever have one.csv file within it). Thanks!谢谢!

NB注意

The corresponding.zip file from a separate URL with shapefiles reads no problem with geopandas.read_file() when I run this code:当我运行此代码时,来自带有 shapefile 的单独 URL 的相应 .zip 文件读取 geopandas.read_file() 没有问题:

import geopandas as gpd

# specify zipped shapefile url
zip_shp_url = 'http://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/files-fichiers/2016/ldb_000b16a_e.zip'
gdf1 = gpd.read_file(zip_shp_url)

Despite having a.pdf file also contained within the.zip, as seen in the image below:尽管.pdf文件也包含在.zip中,如下图所示:

在此处输入图像描述

It would appear that the geopandas.read_file() has the ability to only read the requisite shapefiles for creating the GeoDataFrame while ignoring unnecessary data files.看起来 geopandas.read_file() 只能读取创建 GeoDataFrame 所需的 shapefile,而忽略不必要的数据文件。 Since it is based on Pandas, shouldn't Pandas also have a functionality to only read a.csv within a.zip with multiple other file types?由于它基于 Pandas,Pandas 不应该也具有仅读取 a.csv 中的 a.csv 的功能吗? Any thoughts?有什么想法吗?

import zipfile
import pandas as pd
from io import BytesIO
from urllib.request import urlopen


resp = urlopen(  YOUR_ZIP_LINK  )
files_zip = zipfile.ZipFile(BytesIO(resp.read()))
# files_zip.namelist()
directory_to_extract_to = YOUR_DESTINATION_FOLDER
file = YOUR_csv_FILE_NAME
with files_zip as zip_ref:
    zip_ref.extract(file,directory_to_extract_to)
pd.read_csv(directory_to_extract_to + file) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM