简体   繁体   English

在Python中打开csv.gz文件并打印前100行

[英]Open a csv.gz file in Python and print first 100 rows

I'm trying to get only the first 100 rows of a csv.gz file that has over 4 million rows in Python. 我试图只获得Python中有超过400万行的csv.gz文件的前100行。 I also want information on the # of columns and the headers of each. 我还想了解每列的#列和标题的信息。 How can I do this? 我怎样才能做到这一点?

I looked at python: read lines from compressed text files to figure out how to open the file but I'm struggling to figure out how to actually print the first 100 rows and get some metadata on the information in the columns. 我查看了python:从压缩文本文件中读取行以弄清楚如何打开文件,但我正在努力弄清楚如何实际打印前100行并获取列中信息的一些元数据。

I found this Read first N lines of a file in python but not sure how to marry this to opening the csv.gz file and reading it without saving an uncompressed csv file. 在python中找到了这个读取文件的前N行,但不知道如何将它与打开csv.gz文件结合并读取它而不保存未压缩的csv文件。

I have written this code: 我写了这段代码:

import gzip
import csv
import json
import pandas as pd


df = pd.read_csv('google-us-data.csv.gz', compression='gzip', header=0,    sep=' ', quotechar='"', error_bad_lines=False)
for i in range (100):
print df.next() 

I'm new to Python and I don't understand the results. 我是Python的新手,我不理解结果。 I'm sure my code is wrong and I've been trying to debug it but I don't know which documentation to look at. 我确定我的代码是错的,我一直在尝试调试它,但我不知道要查看哪些文档。

I get these results (and it keeps going down the console - this is an excerpt): 我得到了这些结果(并且它一直在控制台上 - 这是一个摘录):

Skipping line 63: expected 3 fields, saw 7
Skipping line 64: expected 3 fields, saw 7
Skipping line 65: expected 3 fields, saw 7
Skipping line 66: expected 3 fields, saw 7
Skipping line 67: expected 3 fields, saw 7
Skipping line 68: expected 3 fields, saw 7
Skipping line 69: expected 3 fields, saw 7
Skipping line 70: expected 3 fields, saw 7
Skipping line 71: expected 3 fields, saw 7
Skipping line 72: expected 3 fields, saw 7

Pretty much what you've already done, except read_csv also has nrows where you can specify the number of rows you want from the data set. 几乎就是你已经完成的事情,除了read_csv还有nrows ,你可以在其中指定你想要的数据集行数。

Additionally, to prevent the errors you were getting, you can set error_bad_lines to False . 此外,为了防止您遇到的错误,您可以将error_bad_lines设置为False You'll still get warnings (if that bothers you, set warn_bad_lines to False as well). 您仍然会收到警告(如果困扰您,请将warn_bad_lines设置为False )。 These are there to indicate inconsistency in how your dataset is filled out. 这些表示您的数据集填写方式不一致。

import pandas as pd
data = pd.read_csv('google-us-data.csv.gz', nrows=100, compression='gzip',
                   error_bad_lines=False)
print(data)

You can easily do something similar with the csv built-in library, but it'll require a for loop to iterate over the data, has shown in other examples. 您可以轻松地使用csv内置库执行类似操作,但它需要一个for循环来迭代数据,如其他示例所示。

I think you could do something like this (from the gzip module examples ) 我想你可以做这样的事情(来自gzip模块的例子

import gzip
with gzip.open('/home/joe/file.txt.gz', 'rb') as f:
    header = f.readline()
    # Read lines any way you want now. 

The first answer you linked suggests using gzip.GzipFile - this gives you a file-like object that decompresses for you on the fly. 您链接的第一个答案建议使用gzip.GzipFile - 这会为您提供一个类似文件的对象,可以动态解压缩。

Now you just need some way to parse csv data out of a file-like object ... like csv.reader . 现在你只需要一些方法来解析文件类对象中的csv数据...就像csv.reader一样。

The csv.reader object will give you a list of fieldnames, so you know the columns, their names, and how many there are. csv.reader对象将为您提供字段名列表,以便您知道列,它们的名称以及有多少列。

Then you need to get the first 100 csv row objects, which will work exactly like in the second question you linked, and each of those 100 objects will be a list of fields. 然后你需要获得前100个csv行对象,它们将与您链接的第二个问题完全相同,并且这100个对象中的每一个都将是一个字段列表。

So far this is all covered in your linked questions, apart from knowing about the existence of the csv module, which is listed in the library index . 到目前为止,除了了解库索引中列出的csv模块的存在之外,这些都包含在您的链接问题中。

Your code is OK; 你的代码还可以;

pandas read_csv pandas read_csv

warn_bad_lines : boolean, default True warn_bad_lines :布尔值,默认为True

 If error_bad_lines is False, and warn_bad_lines is True, a warning for each “bad line” will be output. (Only valid with C parser). 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM