如何使用熊猫从csv查看单行

Question

I got this csv file from https://www.kaggle.com/currie32/crimes-in-chicago我从https://www.kaggle.com/currie32/crimes-in-chicago得到了这个 csv 文件

I went to read the 2008-20011 csv to a dataframe using Pandas and I got a parseError message stating that in a certain row of the csv there are 41 fields found where it was expecting 23.我去使用 Pandas 将 2008-20011 csv 读取到数据帧，我收到一条 parseError 消息，指出在 csv 的某一行中，找到了 41 个字段，预期为 23。

ParserError: Error tokenizing data. ParserError：标记数据时出错。 C error: Expected 23 fields in line 1149094, saw 41 C 错误：第 1149094 行预期有 23 个字段，看到 41

I used this command to read the csv by simply skipping any bad rows:我使用此命令通过简单地跳过任何错误行来读取 csv：

CHIcrime_df2 = pd.read_csv(path, error_bad_lines=False)

That worked as planned, but I wanted to know what all those extra fields were so I read the file with csv.reader这按计划工作，但我想知道所有这些额外的字段是什么，所以我用 csv.reader 读取文件

with open('path') as data: reader=csv.reader(data) interestingrows=[row for idx, row in enumerate(reader) if idx==1149094]

I expected there to be 41 fields, but there were 23. I also wanted to be sure that I wasn't confusing indexes, so I printed a few before and after;我预计有 41 个字段，但有 23 个。我还想确保我没有混淆索引，所以我在前后打印了一些； each of them had the same number of fields.他们每个人都有相同数量的字段。 Can anyone help me understand what's going on with that?谁能帮我理解这是怎么回事？

Answer 1

David Makovoz has explained the issue already, so I'll just answer your very question: David Makovoz已经解释了这个问题，所以我就回答你的问题：

How to view single row from csv with pandas如何使用熊猫从csv查看单行

If the error occured at line n (1149094), you skip n-1 rows and read just 1 row:如果错误发生在第 n 行 (1149094)，则跳过 n-1 行并仅读取 1 行：

df = pd.read_csv('Chicago_Crimes_2008_to_2011.csv', skiprows=1149093, nrows=1, header=None)

Result:结果：

>>> print(df.values)
[[2023517 7818233 'HS626859' '11/21/2010 11:00:00 PM'
  '079XX S JEFFERY BLVD' 460 'BATTERY' 'SIMPLE' 'STREET' False False 414
  4.0 8.0 46.0 '08B' 1190912.0 1852820.0 2010 '02/04/2016 06:33:39 AM'
  41.751151039 '-87.1:00:00 AM' '031XX W LEXINGTON ST' 810 'THEFT'
  'OVER $500' 'STREET' False False 1134 11.0 24.0 27.0 6 nan nan 2008
  '08/17/2015 03:03:40 PM' nan nan nan]]

Answer 2

I agree it is confusing.我同意这令人困惑。 To figure out what's going on I had to read the file without using pandas:为了弄清楚发生了什么，我必须在不使用熊猫的情况下读取文件：

import zipfile
import pandas as pd
archive = zipfile.ZipFile(fname, 'r')
csvfile = archive.open('Chicago_Crimes_2008_to_2011.csv', 'r')
bdata = csvfile .readlines()
data = [line.decode() for line in bdata]
data_df = pd.DataFrame.from_records(data[1:]) #the first line is the header

So far, so good.到现在为止还挺好。

data_df.shape
>>(2688711, 41)

Ok, there is a row with 41 columns好的，有一行有 41 列

data_df.dropna()
>>1149092   2023517 7818233 HS626859    11/21/2010 11:00:00 PM  079XX S JEFFERY BLVD ...

So it's row # 1149093 not counting the header and 1149094 counting the header.所以它的第 1149093 行不计算标题，1149094 计算标题。

print (data[1149093])
>>['2023517', '7818233', 'HS626859', '11/21/2010 11:00:00 PM', '079XX S JEFFERY BLVD', '0460', 'BATTERY', 'SIMPLE', 'STREET', 'False', 'False', '414', '4.0', '8.0', '46.0', '08B', '1190912.0', '1852820.0', '2010', '02/04/2016 06:33:39 AM', '41.751151039', '-87.1:00:00 AM', '031XX W LEXINGTON ST', '0810', 'THEFT', 'OVER $500', 'STREET', 'False', 'False', '1134', '11.0', '24.0', '27.0', '06', '', '', '2008', '08/17/2015 03:03:40 PM', '', '', '']

So, it looks like two rows where written into one with some overlap.所以，它看起来像两行，其中有一些重叠。 But, the bottom line is, you are doing the right thing by ignoring that row CHIcrime_df2 = pd.read_csv(path, error_bad_lines=False)但是，最重要的是，您通过忽略该行CHIcrime_df2 = pd.read_csv(path, error_bad_lines=False)正确的事情CHIcrime_df2 = pd.read_csv(path, error_bad_lines=False)

如何使用熊猫从csv查看单行

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-08-01 14:15:41

解决方案2
0 2019-07-31 20:06:42

如何使用熊猫从csv查看单行

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-08-01 14:15:41

解决方案2 0 2019-07-31 20:06:42

解决方案1
1 已采纳 2019-08-01 14:15:41

解决方案2
0 2019-07-31 20:06:42