繁体   English   中英

如何使用熊猫从csv文件中每次读取10条记录?

[英]How to read 10 records each time from csv file using pandas?

我想读取一个具有1000行的csv文件,因此,我决定分块读取此文件。 但是我在读取此csv文件时遇到问题。

我想在第一次迭代时读取前10条记录,并在第二次迭代时将其特定的列转换为python字典,因此跳过前10条记录并像这样读取接下来的10条记录。

Input.csv-

time,line_id,high,low,avg,total,split_counts
1468332421098000,206,50879,50879,50879,2,"[50000,2]"
1468332421195000,206,39556,39556,39556,2,"[30000,2]"
1468332421383000,206,61636,61636,61636,2,"[60000,2]"
1468332423568000,206,47315,38931,43123,4,"[30000,2][40000,2]"
1468332423489000,206,38514,38445,38475,6,"[30000,6]"
1468332421672000,206,60079,60079,60079,2,"[60000,2]"
1468332421818000,206,44664,44664,44664,2,"[40000,2]"
1468332422164000,206,48500,48500,48500,2,"[40000,2]"
1468332423490000,206,39469,37894,38206,12,"[30000,12]"
1468332422538000,206,44023,44023,44023,2,"[40000,2]"
1468332423491000,206,38813,38813,38813,2,"[30000,2]"
1468332423528000,206,75970,75970,75970,2,"[70000,2]"
1468332423533000,206,42546,42470,42508,4,"[40000,4]"
1468332423536000,206,41065,40888,40976,4,"[40000,4]"
1468332423566000,206,66401,62453,64549,6,"[60000,6]"

程序代码

if __name__ == '__main__':
   s = 0
   while(True):
      n = 10
      df = pandas.read_csv('Input.csv', skiprows=s, nrows=n)
      d = dict(zip(df.time, df.split_counts))
      print d
      s += n

我正面临着这个问题-

AttributeError: 'DataFrame' object has no attribute 'time'

我知道在第二次迭代中它无法识别时间和split_counts属性,但是我想做什么呢?

第一次迭代应该可以正常工作,但是任何其他迭代都是有问题的。

read_csv具有headers kwarg,其默认值是infer (基本上为0 )。 这意味着已解析的csv中的第一行将用作数据框中的列名。

read_csv还具有另一个kwarg names

文档中所述:

header:int或int列表,默认“ infer”行号(用作列名)以及数据的开头。 如果未传递任何名称,则默认行为就像设置为0,否则设置为None。 显式传递header = 0以便能够替换现有名称。 标头可以是整数列表,这些整数指定列中多个索引的行位置,例如[0,1,3]。 未指定的中间行将被跳过(例如,本示例中的2被跳过)。 请注意,如果skip_blank_lines = True,此参数将忽略注释行和空行,因此header = 0表示数据的第一行,而不是文件的第一行。

names:类似数组,默认为None要使用的列名列表。 如果文件不包含标题行,则应显式传递header = None

您应该将headers=Nonenames=['time', 'line_id', 'high', 'low', 'avg', 'total', 'split_counts']传递给read_csv

您可以在read_csv使用chunksize

import pandas as pd
import io

temp=u'''time,line_id,high,low,avg,total,split_counts
1468332421098000,206,50879,50879,50879,2,"[50000,2]"
1468332421195000,206,39556,39556,39556,2,"[30000,2]"
1468332421383000,206,61636,61636,61636,2,"[60000,2]"
1468332423568000,206,47315,38931,43123,4,"[30000,2][40000,2]"
1468332423489000,206,38514,38445,38475,6,"[30000,6]"
1468332421672000,206,60079,60079,60079,2,"[60000,2]"
1468332421818000,206,44664,44664,44664,2,"[40000,2]"
1468332422164000,206,48500,48500,48500,2,"[40000,2]"
1468332423490000,206,39469,37894,38206,12,"[30000,12]"
1468332422538000,206,44023,44023,44023,2,"[40000,2]"
1468332423491000,206,38813,38813,38813,2,"[30000,2]"
1468332423528000,206,75970,75970,75970,2,"[70000,2]"
1468332423533000,206,42546,42470,42508,4,"[40000,4]"
1468332423536000,206,41065,40888,40976,4,"[40000,4]"
1468332423566000,206,66401,62453,64549,6,"[60000,6]"'''
#after testing replace io.StringIO(temp) to filename

#for testing 2
reader = pd.read_csv(io.StringIO(temp), chunksize=2)
print (reader)
<pandas.io.parsers.TextFileReader object at 0x000000000AD1CD68>
for df in reader:
    print(dict(zip(df.time, df.split_counts)))

{1468332421098000: '[50000,2]', 1468332421195000: '[30000,2]'}
{1468332421383000: '[60000,2]', 1468332423568000: '[30000,2][40000,2]'}
{1468332423489000: '[30000,6]', 1468332421672000: '[60000,2]'}
{1468332421818000: '[40000,2]', 1468332422164000: '[40000,2]'}
{1468332423490000: '[30000,12]', 1468332422538000: '[40000,2]'}
{1468332423491000: '[30000,2]', 1468332423528000: '[70000,2]'}
{1468332423533000: '[40000,4]', 1468332423536000: '[40000,4]'}
{1468332423566000: '[60000,6]'}

请参阅熊猫文档

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM