[英]pandas read_csv skip rows of unwanted descriptions and blank lines till the real data part
I have many csv files and want to read in. I want to skip the beginning rows till the line begins with real data.我有很多 csv 文件并想读入。我想跳过起始行,直到该行以实际数据开头。 My files happen to begin with certain string like "OPQ" or "BST".
我的文件碰巧以某些字符串开头,例如“OPQ”或“BST”。 The files look like:
这些文件看起来像:
"This is a new record.
There are some missing data.
The test condition is 60 degree"
OPQ , 11 , speed , -3 , 20
BST , 20 , speed , 4 , 10
....
The first several lines are varying.前几行是不同的。 Just want to skip the first several rows which might be 3 lines or more descriptions and then several lines of blank lines.
只想跳过可能是 3 行或更多描述的前几行,然后是几行空行。 The data begins from the line begin with "OPQ" or "BST".
数据从以“OPQ”或“BST”开头的行开始。 pandas.read_csv skiprows only skip a predefined number of rows which does not work for my case.
pandas.read_csv skiprows 只跳过预定义的行数,这对我的情况不起作用。
Thanks.谢谢。
You should be able to do this in the following manner -您应该能够通过以下方式做到这一点 -
my_cols = ["A", "B", "C", "D", "E"] #You will need to add all column names here since your data is not uniform
df = pd.read_csv("YOUR_CSV_HERE.csv", names=my_cols, engine='python')
start_val= "OPQ"
start_index = df.A[df.A == start_val].index.tolist()[0]
df1 = df.iloc[start_index:, :]
df1 = df1.reset_index(drop=True)
df1
should have all your data including and after the row that contains the value "OPQ" with all its indexes reset. df1
应该包含所有数据,包括包含值“OPQ”的行及其所有索引重置之后。
What this snippet basically does is -这个片段的主要作用是 -
NaN
for missing values in expected columnsNaN
制作一个 daframe,用于预期列中的缺失值I would recommend using shell commands here.我建议在这里使用 shell 命令。 That way, you can save memory as you do not need to fill the data in memory first .
这样,您可以节省内存,因为您不需要先将数据填充到内存中。 The method
pd.read_csv()
has param skiprows
which takes arguments as described below.方法
pd.read_csv()
具有参数skiprows
,它采用如下所述的参数。
skiprows : list-like, int or callable, optional Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
skiprows : list-like, int 或 callable, 可选的要跳过的行号(0-indexed)或要跳过的行数(int)在文件的开头。
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise.
如果可调用,则可调用函数将根据行索引进行评估,如果应跳过该行则返回 True,否则返回 False。 An example of a valid callable argument would be lambda x: x in [0, 2].
一个有效的可调用参数的例子是 lambda x: x in [0, 2]。
You can specify row numbers, but you first need to know what are them.您可以指定行号,但首先需要知道它们是什么。 One easiest way would be to get line numbers with the following shell command.
一种最简单的方法是使用以下 shell 命令获取行号。
Process过程
Let's say you have data file in .tsv
format as data.tsv
假设您有
.tsv
格式的数据文件作为data.tsv
OPQ 11 speed -3 20
BST 20 speed 4 10
OPQ 11 speed -3 20
BST 20 speed 4 10
We want to filter out 1st and 3rd row.我们要过滤掉第一行和第三行。
, then you would do ,那么你会做
$ cat -n data.tsv | grep OPQ | awk '{print $1}' > filter.csv
This command writes line numbes where OPQ
exists to the file called filter.csv
.此命令将
OPQ
所在的行号写入名为filter.csv
的文件中。 So filter.csv
looks like this所以
filter.csv
看起来像这样
1
3
Now, we can tell pandas which rows to be skipped.现在,我们可以告诉 pandas 要跳过哪些行。
Important NOTE: See the info on skiprows
parameter stating line numbers (0-indexed), but we have line numbers which are 1-indexed, so we need to change it in the code easily.重要提示:请参阅
skiprows
参数说明行号(0 索引)的信息,但我们有1 索引的行号,因此我们需要在代码中轻松更改它。
Code代码
import pandas as pd
filtered_rows = pd.read_csv('./filter.csv', header=None)
filtered_rows[0] = filtered_rows[0] - 1 # assuring to be 0-indexed
filtered_rows = filtered_rows[0].tolist()
data = pd.read_csv('./data.tsv', sep='\t', header=None,
skiprows=filtered_rows)
Output输出
0 1 2 3 4
0 BST 20 speed 4 10
1 BST 20 speed 4 10
Pandas will also accept an open file (or file-like) object instead of a filepath. Pandas 也将接受一个打开的文件(或类似文件)对象而不是文件路径。 You can use Python to open the file and read the lines you don't want until you are at the right place in the file, then Pandas will only process the lines that are left.
您可以使用 Python 打开文件并读取您不想要的行,直到您在文件中的正确位置,然后 Pandas 将只处理剩下的行。
import pandas as pd
f = open("data.csv")
# Throw away lines of the file until just before the data starts
# In the example the last line before the actual data starts is a blank line
while f.readline() != '\n':
pass
# Pandas will only process the lines from the current file position onwards
df = pd.read_csv(f, header=None)
# Don't forget to close the file when you're done
f.close()
# Do whatever you want with dataframe here
print(df)
I assumed the actual data was separated from the unwanted first part of the text file by a blank line.我假设实际数据与文本文件的不需要的第一部分被一个空行分开。 If you need to actually check the first line of the data, then it is a little trickier, as you will need to move the file position back after reading the line .
如果您需要实际检查数据的第一行,则有点棘手,因为您需要在读取该行后将文件位置移回。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.