简体   繁体   English

如何使用熊猫从csv查看单行

[英]How to view single row from csv with pandas

I got this csv file from https://www.kaggle.com/currie32/crimes-in-chicago我从https://www.kaggle.com/currie32/crimes-in-chicago得到了这个 csv 文件

I went to read the 2008-20011 csv to a dataframe using Pandas and I got a parseError message stating that in a certain row of the csv there are 41 fields found where it was expecting 23.我去使用 Pandas 将 2008-20011 csv 读取到数据帧,我收到一条 parseError 消息,指出在 csv 的某一行中,找到了 41 个字段,预期为 23。

ParserError: Error tokenizing data. ParserError:标记数据时出错。 C error: Expected 23 fields in line 1149094, saw 41 C 错误:第 1149094 行预期有 23 个字段,看到 41

I used this command to read the csv by simply skipping any bad rows:我使用此命令通过简单地跳过任何错误行来读取 csv:

CHIcrime_df2 = pd.read_csv(path, error_bad_lines=False)

That worked as planned, but I wanted to know what all those extra fields were so I read the file with csv.reader这按计划工作,但我想知道所有这些额外的字段是什么,所以我用 csv.reader 读取文件

with open('path') as data: reader=csv.reader(data) interestingrows=[row for idx, row in enumerate(reader) if idx==1149094]

I expected there to be 41 fields, but there were 23. I also wanted to be sure that I wasn't confusing indexes, so I printed a few before and after;我预计有 41 个字段,但有 23 个。我还想确保我没有混淆索引,所以我在前后打印了一些; each of them had the same number of fields.他们每个人都有相同数量的字段。 Can anyone help me understand what's going on with that?谁能帮我理解这是怎么回事?

David Makovoz has explained the issue already, so I'll just answer your very question: David Makovoz已经解释了这个问题,所以我就回答你的问题:

How to view single row from csv with pandas如何使用熊猫从csv查看单行

If the error occured at line n (1149094), you skip n-1 rows and read just 1 row:如果错误发生在第 n 行 (1149094),则跳过 n-1 行并仅读取 1 行:

df = pd.read_csv('Chicago_Crimes_2008_to_2011.csv', skiprows=1149093, nrows=1, header=None)

Result:结果:

>>> print(df.values)
[[2023517 7818233 'HS626859' '11/21/2010 11:00:00 PM'
  '079XX S JEFFERY BLVD' 460 'BATTERY' 'SIMPLE' 'STREET' False False 414
  4.0 8.0 46.0 '08B' 1190912.0 1852820.0 2010 '02/04/2016 06:33:39 AM'
  41.751151039 '-87.1:00:00 AM' '031XX W LEXINGTON ST' 810 'THEFT'
  'OVER $500' 'STREET' False False 1134 11.0 24.0 27.0 6 nan nan 2008
  '08/17/2015 03:03:40 PM' nan nan nan]]

I agree it is confusing.我同意这令人困惑。 To figure out what's going on I had to read the file without using pandas:为了弄清楚发生了什么,我必须在不使用熊猫的情况下读取文件:

import zipfile
import pandas as pd
archive = zipfile.ZipFile(fname, 'r')
csvfile = archive.open('Chicago_Crimes_2008_to_2011.csv', 'r')
bdata = csvfile .readlines()
data = [line.decode() for line in bdata]
data_df = pd.DataFrame.from_records(data[1:]) #the first line is the header

So far, so good.到现在为止还挺好。

data_df.shape
>>(2688711, 41)

Ok, there is a row with 41 columns好的,有一行有 41 列

data_df.dropna()
>>1149092   2023517 7818233 HS626859    11/21/2010 11:00:00 PM  079XX S JEFFERY BLVD ...

So it's row # 1149093 not counting the header and 1149094 counting the header.所以它的第 1149093 行不计算标题,1149094 计算标题。

print (data[1149093])
>>['2023517', '7818233', 'HS626859', '11/21/2010 11:00:00 PM', '079XX S JEFFERY BLVD', '0460', 'BATTERY', 'SIMPLE', 'STREET', 'False', 'False', '414', '4.0', '8.0', '46.0', '08B', '1190912.0', '1852820.0', '2010', '02/04/2016 06:33:39 AM', '41.751151039', '-87.1:00:00 AM', '031XX W LEXINGTON ST', '0810', 'THEFT', 'OVER $500', 'STREET', 'False', 'False', '1134', '11.0', '24.0', '27.0', '06', '', '', '2008', '08/17/2015 03:03:40 PM', '', '', '']

So, it looks like two rows where written into one with some overlap.所以,它看起来像两行,其中有一些重叠。 But, the bottom line is, you are doing the right thing by ignoring that row CHIcrime_df2 = pd.read_csv(path, error_bad_lines=False)但是,最重要的是,您通过忽略该行CHIcrime_df2 = pd.read_csv(path, error_bad_lines=False)正确的事情CHIcrime_df2 = pd.read_csv(path, error_bad_lines=False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM