简体   繁体   English

在 csv 导入 pandas 期间跳过行

[英]Skip rows during csv import pandas

I'm trying to import a.csv file using pandas.read_csv() , however, I don't want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).我正在尝试使用pandas.read_csv()导入 a.csv 文件,但是,我不想导入数据文件的第二行(索引 = 1 的行用于 0 索引)。

I can't see how not to import it because the arguments used with the command seem ambiguous:我看不出如何不导入它,因为与命令一起使用的 arguments 似乎模棱两可:

From the pandas website:从 pandas 网站:

skiprows : list-like or integer skiprows : 类似列表或 integer

Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file."在文件开头要跳过的行号(0 索引)或要跳过的行数(int)。”

If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?如果我在 arguments 中放入skiprows=1 ,它如何知道是跳过第一行还是跳过索引为 1 的行?

You can try yourself:你可以自己试试:

>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
   0  1
0  1  2
1  5  6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
   0  1
0  3  4
1  5  6

I don't have reputation to comment yet, but I want to add to alko answer for further reference.我还没有评论的声誉,但我想添加到alko答案以供进一步参考。

From the docs :文档

skiprows: A collection of numbers for rows in the file to skip. skiprows:文件中要跳过的行的数字集合。 Can also be an integer to skip the first n rows也可以是整数以跳过前 n 行

I got the same issue while running the skiprows while reading the csv file.我在读取 csv 文件时运行 skiprows 时遇到了同样的问题。 I was doning skip_rows=1 this will not work我正在做 skip_rows=1 这行不通

Simple example gives an idea how to use skiprows while reading csv file.简单示例给出了如何在读取 csv 文件时使用跳过行的想法。

import pandas as pd

#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1)  ## pandas as pd

#print the data frame
df

All of these answers miss one important point -- the n'th line is the n'th line in the file, and not the n'th row in the dataset.所有这些答案都忽略了一个重要的点——第 n 行是文件中的第 n 行,而不是数据集中的第 n 行。 I have a situation where I download some antiquated stream gauge data from the USGS.我有一种情况,我从 USGS 下载了一些过时的流量测量仪数据。 The head of the dataset is commented with '#', the first line after that are the labels, next comes a line that describes the date types, and last the data itself.数据集的头部用“#”注释,之后的第一行是标签,接下来是描述日期类型的行,最后是数据本身。 I never know how many comment lines there are, but I know what the first couple of rows are.我永远不知道有多少注释行,但我知道前几行是什么。 Example:例子:

----------------------------- WARNING ---------------------------------- - - - - - - - - - - - - - - - 警告 - - - - - - - - - - --------------

Some of the data that you have obtained from this US Geological Survey database您从这个美国地质调查局数据库中获得的一些数据

may not have received Director's approval.可能没有得到董事的批准。 ... agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd ... Agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd

5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A 5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A

It would be nice if there was a way to automatically skip the n'th row as well as the n'th line.如果有一种方法可以自动跳过第 n 行和第 n 行,那就太好了。

As a note, I was able to fix my issue with:作为说明,我能够通过以下方式解决我的问题:

import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)

You have the following options to skip rows in Pandas:您有以下选项可以跳过 Pandas 中的行:

from io import StringIO

csv = \
"""col1,col2
1,a
2,b
3,c
4,d
"""
pd.read_csv(StringIO(csv))

# Output:
   col1 col2  # index 0
0     1    a  # index 1
1     2    b  # index 2
2     3    c  # index 3
3     4    d  # index 4

Skip two lines at the start of the file (index 0 and 1).在文件开头跳过两行(索引 0 和 1)。 Column names are skipped as well (index 0) and the top line is used for column names.列名也被跳过(索引 0),顶行用于列名。 To add column names use names = ['col1', 'col2'] parameter:要添加列名称,请使用names = ['col1', 'col2']参数:

pd.read_csv(StringIO(csv), skiprows=2)

# Output:
   2  b
0  3  c
1  4  d

Skip second and fourth lines (index 1 and 3):跳过第二行和第四行(索引 1 和 3):

pd.read_csv(StringIO(csv), skiprows=[1, 3])

# Output:
   col1 col2
0     2    b
1     4    d

Skip last two lines:跳过最后两行:

pd.read_csv(StringIO(csv), engine='python', skipfooter=2)

# Output:
   col1 col2
0     1    a
1     2    b

Use a lambda function to skip every second line (index 1 and 3):使用 lambda 函数跳过每一行(索引 1 和 3):

pd.read_csv(StringIO(csv), skiprows=lambda x: (x % 2) != 0)

# Output:
   col1 col2
0     2    b
1     4    d

skip[1]将跳过第二行,而不是第一行。

Also be sure that your file is actually a CSV file.还要确保您的文件实际上是一个 CSV 文件。 For example, if you had an .xls file, and simply changed the file extension to .csv, the file won't import and will give the error above.例如,如果您有一个 .xls 文件,并且只是将文件扩展名更改为 .csv,则该文件不会导入并会出现上述错误。 To check to see if this is your problem open the file in excel and it will likely say:要检查这是否是您的问题,请在 excel 中打开文件,它可能会说:

"The file format and extension of 'Filename.csv' don't match. The file could be corrupted or unsafe. Unless you trust its source, don't open it. Do you want to open it anyway?" “'Filename.csv' 的文件格式和扩展名不匹配。该文件可能已损坏或不安全。除非您信任其来源,否则请勿打开它。您还是要打开它吗?”

To fix the file: open the file in Excel, click "Save As", Choose the file format to save as (use .cvs), then replace the existing file.修复文件:在 Excel 中打开文件,单击“另存为”,选择要另存为的文件格式(使用 .cvs),然后替换现有文件。

This was my problem, and fixed the error for me.这是我的问题,并为我修复了错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM