简体   繁体   中英

Trying to split a tsv file into two by looking for an empty line

I'm trying to use pandas to split a tsv file that looks something like this:

xy

xy

[empty row]

xyzabc

xyzabc

into 2 separate dataframes with one containing the half before the empty line, and one containing the rest of the file - this is because I can't read the whole file into one dataframe as the two portions have a different amount of columns.

Is there a way I can establish the empty row as a "stopping point" for the first dataframe, and read the rest of the tsv file into another dataframe?

Currently, I'm solving this by just skipping lines using pd.read_csv(file_name, skiprows = 3, delimiter = '\\t'), but using this method is not a very good approach.

Thanks!

Try this:

First, you need to read your file as a whole to a df. Do not skip blank lines, this will read the blank lines as NaN .

df = pd.read_csv(filename, delimiter='\t', skip_blank_lines=False)

Now, identify the empty row and create separate groups in the df.

df['emptyrow'] = df.isnull()
df['group'] = (df['emptyrow'] != df['emptyrow'].shift()).cumsum()
groups = df.groupby(by='group')

With this, we have groups within df which can be accessed using groups.get_group(key) . Also, we can have a dict of data frames for each group key.

split_dfs = {}
for grp in groups.groups.keys():
    split_dfs[grp] = groups.get_group(grp).drop(['emptyrow','group'], axis=1)

Now, split_dfs is a dict of dfs each with a subset of the original df based on the group we created.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM