简体   繁体   English

pandas read_csv 跳过不需要的描述行和空行,直到真正的数据部分

[英]pandas read_csv skip rows of unwanted descriptions and blank lines till the real data part

I have many csv files and want to read in. I want to skip the beginning rows till the line begins with real data.我有很多 csv 文件并想读入。我想跳过起始行,直到该行以实际数据开头。 My files happen to begin with certain string like "OPQ" or "BST".我的文件碰巧以某些字符串开头,例如“OPQ”或“BST”。 The files look like:这些文件看起来像:

"This is a new record.
 There are some missing data.
 The test condition is 60 degree"

OPQ  , 11  , speed , -3 , 20
BST  , 20  , speed , 4  , 10
....

The first several lines are varying.前几行是不同的。 Just want to skip the first several rows which might be 3 lines or more descriptions and then several lines of blank lines.只想跳过可能是 3 行或更多描述的前几行,然后是几行空行。 The data begins from the line begin with "OPQ" or "BST".数据从以“OPQ”或“BST”开头的行开始。 pandas.read_csv skiprows only skip a predefined number of rows which does not work for my case. pandas.read_csv skiprows 只跳过预定义的行数,这对我的情况不起作用。

Thanks.谢谢。

You should be able to do this in the following manner -您应该能够通过以下方式做到这一点 -

my_cols = ["A", "B", "C", "D", "E"] #You will need to add all column names here since your data is not uniform

df = pd.read_csv("YOUR_CSV_HERE.csv", names=my_cols, engine='python')

start_val= "OPQ"

start_index = df.A[df.A == start_val].index.tolist()[0]
df1 = df.iloc[start_index:, :]
df1 = df1.reset_index(drop=True)

df1 should have all your data including and after the row that contains the value "OPQ" with all its indexes reset. df1应该包含所有数据,包括包含值“OPQ”的行及其所有索引重置之后。

What this snippet basically does is -这个片段的主要作用是 -

  • sets up expected column names设置预期的列名
  • makes a daframe based on your csv with NaN for missing values in expected columns根据您的 csv 和NaN制作一个 daframe,用于预期列中的缺失值
  • goes through the dataframe to find the index of the row you want to start from (by finding a specific value in a specific column)遍历数据框以找到要从中开始的行的索引(通过在特定列中查找特定值)
  • splits the datafram based on this index and reindexes the new dataframe根据此索引拆分数据帧并重新索引新的数据帧

I would recommend using shell commands here.我建议在这里使用 shell 命令。 That way, you can save memory as you do not need to fill the data in memory first .这样,您可以节省内存,因为您不需要先将数据填充到内存中 The method pd.read_csv() has param skiprows which takes arguments as described below.方法pd.read_csv()具有参数skiprows ,它采用如下所述的参数。

skiprows : list-like, int or callable, optional Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. skiprows : list-like, int 或 callable, 可选的要跳过的行号(0-indexed)或要跳过的行数(int)在文件的开头。

If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise.如果可调用,则可调用函数将根据行索引进行评估,如果应跳过该行则返回 True,否则返回 False。 An example of a valid callable argument would be lambda x: x in [0, 2].一个有效的可调用参数的例子是 lambda x: x in [0, 2]。

You can specify row numbers, but you first need to know what are them.您可以指定行号,但首先需要知道它们是什么。 One easiest way would be to get line numbers with the following shell command.一种最简单的方法是使用以下 shell 命令获取行号。

Process过程

Let's say you have data file in .tsv format as data.tsv假设您有.tsv格式的数据文件作为data.tsv

OPQ   11  speed  -3  20
BST   20  speed  4   10
OPQ   11  speed  -3  20
BST   20  speed  4   10

We want to filter out 1st and 3rd row.我们要过滤掉第一行和第三行。

, then you would do ,那么你会做

$ cat -n data.tsv | grep OPQ | awk '{print $1}' > filter.csv

This command writes line numbes where OPQ exists to the file called filter.csv .此命令将OPQ所在的行号写入名为filter.csv的文件中。 So filter.csv looks like this所以filter.csv看起来像这样

1
3

Now, we can tell pandas which rows to be skipped.现在,我们可以告诉 pandas 要跳过哪些行。

Important NOTE: See the info on skiprows parameter stating line numbers (0-indexed), but we have line numbers which are 1-indexed, so we need to change it in the code easily.重要提示:请参阅skiprows参数说明行号(0 索引)的信息,但我们有1 索引的行号,因此我们需要在代码中轻松更改它。

Code代码

import pandas as pd

filtered_rows = pd.read_csv('./filter.csv', header=None)
filtered_rows[0] = filtered_rows[0] - 1 # assuring to be 0-indexed
filtered_rows = filtered_rows[0].tolist()

data = pd.read_csv('./data.tsv', sep='\t', header=None,
skiprows=filtered_rows)

Output输出

     0   1      2  3   4
0  BST  20  speed  4  10
1  BST  20  speed  4  10

Pandas will also accept an open file (or file-like) object instead of a filepath. Pandas 也将接受一个打开的文件(或类似文件)对象而不是文件路径。 You can use Python to open the file and read the lines you don't want until you are at the right place in the file, then Pandas will only process the lines that are left.您可以使用 Python 打开文件并读取您想要的行,直到您在文件中的正确位置,然后 Pandas 将只处理剩下的行。

import pandas as pd

f = open("data.csv")

# Throw away lines of the file until just before the data starts
# In the example the last line before the actual data starts is a blank line
while f.readline() != '\n':
    pass

# Pandas will only process the lines from the current file position onwards
df = pd.read_csv(f, header=None)

# Don't forget to close the file when you're done
f.close()

# Do whatever you want with dataframe here
print(df)

I assumed the actual data was separated from the unwanted first part of the text file by a blank line.我假设实际数据与文本文件的不需要的第一部分被一个空行分开。 If you need to actually check the first line of the data, then it is a little trickier, as you will need to move the file position back after reading the line .如果您需要实际检查数据的第一行,则有点棘手,因为您需要在读取该行后将文件位置移回

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM