pandas read_csv skip rows of unwanted descriptions and blank lines till the real data part

Question

I have many csv files and want to read in. I want to skip the beginning rows till the line begins with real data. My files happen to begin with certain string like "OPQ" or "BST". The files look like:

"This is a new record.
 There are some missing data.
 The test condition is 60 degree"

OPQ  , 11  , speed , -3 , 20
BST  , 20  , speed , 4  , 10
....

The first several lines are varying. Just want to skip the first several rows which might be 3 lines or more descriptions and then several lines of blank lines. The data begins from the line begin with "OPQ" or "BST". pandas.read_csv skiprows only skip a predefined number of rows which does not work for my case.

Thanks.

Answer 1

You should be able to do this in the following manner -

my_cols = ["A", "B", "C", "D", "E"] #You will need to add all column names here since your data is not uniform

df = pd.read_csv("YOUR_CSV_HERE.csv", names=my_cols, engine='python')

start_val= "OPQ"

start_index = df.A[df.A == start_val].index.tolist()[0]
df1 = df.iloc[start_index:, :]
df1 = df1.reset_index(drop=True)

df1 should have all your data including and after the row that contains the value "OPQ" with all its indexes reset.

What this snippet basically does is -

sets up expected column names
makes a daframe based on your csv with NaN for missing values in expected columns
goes through the dataframe to find the index of the row you want to start from (by finding a specific value in a specific column)
splits the datafram based on this index and reindexes the new dataframe

Answer 2

I would recommend using shell commands here. That way, you can save memory as you do not need to fill the data in memory first . The method pd.read_csv() has param skiprows which takes arguments as described below.

skiprows : list-like, int or callable, optional Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

You can specify row numbers, but you first need to know what are them. One easiest way would be to get line numbers with the following shell command.

Process

Let's say you have data file in .tsv format as data.tsv

OPQ   11  speed  -3  20
BST   20  speed  4   10
OPQ   11  speed  -3  20
BST   20  speed  4   10

We want to filter out 1st and 3rd row.

, then you would do

$ cat -n data.tsv | grep OPQ | awk '{print $1}' > filter.csv

This command writes line numbes where OPQ exists to the file called filter.csv . So filter.csv looks like this

1
3

Now, we can tell pandas which rows to be skipped.

Important NOTE: See the info on skiprows parameter stating line numbers (0-indexed), but we have line numbers which are 1-indexed, so we need to change it in the code easily.

Code

import pandas as pd

filtered_rows = pd.read_csv('./filter.csv', header=None)
filtered_rows[0] = filtered_rows[0] - 1 # assuring to be 0-indexed
filtered_rows = filtered_rows[0].tolist()

data = pd.read_csv('./data.tsv', sep='\t', header=None,
skiprows=filtered_rows)

Output

     0   1      2  3   4
0  BST  20  speed  4  10
1  BST  20  speed  4  10

Answer 3

Pandas will also accept an open file (or file-like) object instead of a filepath. You can use Python to open the file and read the lines you don't want until you are at the right place in the file, then Pandas will only process the lines that are left.

import pandas as pd

f = open("data.csv")

# Throw away lines of the file until just before the data starts
# In the example the last line before the actual data starts is a blank line
while f.readline() != '\n':
    pass

# Pandas will only process the lines from the current file position onwards
df = pd.read_csv(f, header=None)

# Don't forget to close the file when you're done
f.close()

# Do whatever you want with dataframe here
print(df)

I assumed the actual data was separated from the unwanted first part of the text file by a blank line. If you need to actually check the first line of the data, then it is a little trickier, as you will need to move the file position back after reading the line .

pandas read_csv skip rows of unwanted descriptions and blank lines till the real data part

Question

3 answers

solution1
3 2020-01-13 08:26:48

solution2
1 2020-01-13 08:20:46

solution3
1 2020-01-13 08:59:49

pandas read_csv skip rows of unwanted descriptions and blank lines till the real data part

Question

3 answers

solution1 3 2020-01-13 08:26:48

solution2 1 2020-01-13 08:20:46

solution3 1 2020-01-13 08:59:49

solution1
3 2020-01-13 08:26:48

solution2
1 2020-01-13 08:20:46

solution3
1 2020-01-13 08:59:49