简体   繁体   中英

Pandas: read_csv ignore rows after a blank line

There is a weird .csv file, something like:

header1,header2,header3
val11,val12,val13
val21,val22,val23
val31,val32,val33

pretty fine, but after these lines, there is always a blank line followed by lots of useless lines. The whole stuff is something line:


header1,header2,header3
val11,val12,val13
val21,val22,val23
val31,val32,val33

dhjsakfjkldsa
fasdfggfhjhgsdfgds
gsdgffsdgfdgsdfgs
gsdfdgsg

The number of lines in the bottom is totally random, the only remark is the empty line before them.

Pandas has a parameter "skipfooter" for ignoring a known number of rows in the footer.

Any idea about how to ignore this rows without actually opening (open()...) the file and removing them?

There is not any option to terminate read_csv function by getting the first blank line. This module isn't capable of accepting/rejecting lines based on desired conditions. It only can ignore blank lines (optional) or rows which disobey the formed shape of data (rows with more separators).

You can normalize the data by the below approaches (without parsing file - pure pandas ):

  1. Knowing the number of the desired\\trash data rows. [Manual]

    pd.read_csv('file.csv', nrows=3) or pd.read_csv('file.csv', skipfooter=4)

  2. Preserving the desired data by eliminating others in DataFrame . [Automatic]

    df.dropna(axis=0, how='any', inplace=True)

The results will be:

  header1 header2 header3
0   val11   val12   val13
1   val21   val22   val23
2   val31   val32   val33

If you're using the csv module, it's fairly trivial to detect an empty row.

import csv 

with open(filename, newline='') as f:
    r = csv.reader(f)
    for l in r:
        if not l:
            break
        #Otherwise, process data

The best way to do this using pandas native functions is a combination of arguments and function calls - a bit messy, but definitely possible!

First, call read_csv with the skip_blank_lines=False , since the default is True .

df = pd.read_csv(<filepath>, skip_blank_lines=False)

Then, create a dataframe that only contains the blank rows, using the isnull or isna method. This works by locating ( .loc ) the indices where all values are null/blank.

blank_df = df.loc[df.isnull().all(1)]

By utilizing the fact that this dataframe preserves the original indices, you can get the index of the first blank row.

Because this uses indexing, you will also want to check that there actually is a blank line in the csv. And finally, you simply slice the original dataframe in order to remove the unwanted lines.

if len(blank_df) > 0:
    first_blank_index = blank_df.index[0]
    df = df[:first_blank_index]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM