I'm trying to use pandas to split a tsv file that looks something like this:
xy
xy
[empty row]
xyzabc
xyzabc
into 2 separate dataframes with one containing the half before the empty line, and one containing the rest of the file - this is because I can't read the whole file into one dataframe as the two portions have a different amount of columns.
Is there a way I can establish the empty row as a "stopping point" for the first dataframe, and read the rest of the tsv file into another dataframe?
Currently, I'm solving this by just skipping lines using pd.read_csv(file_name, skiprows = 3, delimiter = '\\t'), but using this method is not a very good approach.
Thanks!
Try this:
First, you need to read your file as a whole to a df. Do not skip blank lines, this will read the blank lines as NaN
.
df = pd.read_csv(filename, delimiter='\t', skip_blank_lines=False)
Now, identify the empty row and create separate groups in the df.
df['emptyrow'] = df.isnull()
df['group'] = (df['emptyrow'] != df['emptyrow'].shift()).cumsum()
groups = df.groupby(by='group')
With this, we have groups within df which can be accessed using groups.get_group(key)
. Also, we can have a dict of data frames for each group key.
split_dfs = {}
for grp in groups.groups.keys():
split_dfs[grp] = groups.get_group(grp).drop(['emptyrow','group'], axis=1)
Now, split_dfs
is a dict of dfs each with a subset of the original df based on the group we created.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.