简体   繁体   English

将压缩文件作为 Pandas DataFrame 读取

[英]Read a zipped file as a pandas DataFrame

I'm trying to unzip a csv file and pass it into pandas so I can work on the file.我正在尝试解压缩 csv 文件并将其传递给 Pandas,以便我可以处理该文件。
The code I have tried so far is:到目前为止我尝试过的代码是:

import requests, zipfile, StringIO
r = requests.get('http://data.octo.dc.gov/feeds/crime_incidents/archive/crime_incidents_2013_CSV.zip')
z = zipfile.ZipFile(StringIO.StringIO(r.content))
crime2013 = pandas.read_csv(z.read('crime_incidents_2013_CSV.csv'))

After the last line, although python is able to get the file, I get a "does not exist" at the end of the error.在最后一行之后,虽然 python 能够获取文件,但在错误结束时我得到一个“不存在”。

Can someone tell me what I'm doing incorrectly?有人可以告诉我我做错了什么吗?

If you want to read a zipped or a tar.gz file into pandas dataframe, the read_csv methods includes this particular implementation.如果要将压缩文件或 tar.gz 文件读入read_csv数据帧,则read_csv方法包含此特定实现。

df = pd.read_csv('filename.zip')

Or the long form:或长格式:

df = pd.read_csv('filename.zip', compression='zip', header=0, sep=',', quotechar='"')

Description of the compression argument from the docs : 文档中压缩参数的描述:

compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer' For on-the-fly decompression of on-disk data.压缩: {'infer', 'gzip', 'bz2', 'zip', 'xz', None},默认为 'infer' 用于磁盘数据的即时解压缩。 If 'infer' and filepath_or_buffer is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', or '.xz' (otherwise no decompression).如果 'infer' 和 filepath_or_buffer 类似于路径,则检测来自以下扩展名的压缩:'.gz'、'.bz2'、'.zip' 或 '.xz'(否则不解压缩)。 If using 'zip', the ZIP file must contain only one data file to be read in. Set to None for no decompression.如果使用“zip”,则 ZIP 文件必须只包含一个要读入的数据文件。设置为 None 表示不解压。

New in version 0.18.1: support for 'zip' and 'xz' compression. 0.18.1 新版功能:支持“zip”和“xz”压缩。

I think you want to open the ZipFile, which returns a file-like object, rather than read :我想你想open ZipFile,它返回一个类似文件的对象,而不是read

In [11]: crime2013 = pd.read_csv(z.open('crime_incidents_2013_CSV.csv'))

In [12]: crime2013
Out[12]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24567 entries, 0 to 24566
Data columns (total 15 columns):
CCN                            24567  non-null values
REPORTDATETIME                 24567  non-null values
SHIFT                          24567  non-null values
OFFENSE                        24567  non-null values
METHOD                         24567  non-null values
LASTMODIFIEDDATE               24567  non-null values
BLOCKSITEADDRESS               24567  non-null values
BLOCKXCOORD                    24567  non-null values
BLOCKYCOORD                    24567  non-null values
WARD                           24563  non-null values
ANC                            24567  non-null values
DISTRICT                       24567  non-null values
PSA                            24567  non-null values
NEIGHBORHOODCLUSTER            24263  non-null values
BUSINESSIMPROVEMENTDISTRICT    3613  non-null values
dtypes: float64(4), int64(1), object(10)

It seems you don't even have to specify the compression any more.看来您甚至不必再指定压缩了。 The following snippet loads the data from filename.zip into df.以下代码段将 filename.zip 中的数据加载到 df 中。

import pandas as pd
df = pd.read_csv('filename.zip')

(Of course you will need to specify separator, header, etc. if they are different from the defaults.) (当然,如果它们与默认值不同,您将需要指定分隔符、标题等。)

For " zip " files, you can use import zipfile and your code will be working simply with these lines:对于“ zip ”文件,您可以使用import zipfile并且您的代码将简单地使用以下几行:

import zipfile
import pandas as pd
with zipfile.ZipFile("Crime_Incidents_in_2013.zip") as z:
   with z.open("Crime_Incidents_in_2013.csv") as f:
      train = pd.read_csv(f, header=0, delimiter="\t")
      print(train.head())    # print the first 5 rows

And the result will be:结果将是:

X,Y,CCN,REPORT_DAT,SHIFT,METHOD,OFFENSE,BLOCK,XBLOCK,YBLOCK,WARD,ANC,DISTRICT,PSA,NEIGHBORHOOD_CLUSTER,BLOCK_GROUP,CENSUS_TRACT,VOTING_PRECINCT,XCOORD,YCOORD,LATITUDE,LONGITUDE,BID,START_DATE,END_DATE,OBJECTID
0  -77.054968548763071,38.899775938598317,0925135...                                                                                                                                                               
1  -76.967309569035052,38.872119553647011,1003352...                                                                                                                                                               
2  -76.996184958456539,38.927921847721443,1101010...                                                                                                                                                               
3  -76.943077541353617,38.883686046653935,1104551...                                                                                                                                                               
4  -76.939209158039446,38.892278093281632,1125028...

I guess what your looking is the following我猜你的样子如下

from io import BytesIO
import requests
import pandas as pd

result = requests.get("https://www.xxx.zzz/file.zip")
df = pd.read_csv(BytesIO(result.content),compression='zip', header=0, sep=',', quotechar='"')

Read these article to understand why: https://medium.com/dev-bits/ultimate-guide-for-working-with-io-streams-and-zip-archives-in-python-3-6f3cf96dca50阅读这些文章以了解原因: https : //medium.com/dev-bits/ultimate-guide-for-working-with-io-streams-and-zip-archives-in-python-3-6f3cf96dca50

https://www.kaggle.com/jboysen/quick-gz-pandas-tutorial https://www.kaggle.com/jboysen/quick-gz-pandas-tutorial

Please follow this link.请点击此链接。

import pandas as pd
traffic_station_df = pd.read_csv('C:\\Folders\\Jupiter_Feed.txt.gz', compression='gzip',
                                 header=1, sep='\t', quotechar='"')

#traffic_station_df['Address'] = 'address'

#traffic_station_df.append(traffic_station_df)
print(traffic_station_df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM