简体   繁体   中英

How to split a pandas data frame into multiple data frames based on ID?

I have this data in a single csv file that have multiple headings of ID, Product, etc. I want the last value (last row) of each set. All other rows are to be deleted. Can someone help me with a script to do so? The data looks like this:

ID Product Fat SNF Protein
365 PB 11.11.20 Fresh Milk Monitoring 2016 4.08 8.52 3.19
365 PB 11.11.20 Fresh Milk Monitoring 2016 4.04 8.52 3.2
365 PB 11.11.20 Fresh Milk Monitoring 2016 0.026 0.004 0.009
365 PB 11.11.20 Fresh Milk Monitoring 2016 4.06 8.52 3.2
ID Product Fat SNF Protein
465 PB 11.11.20 Fresh Milk Monitoring 2016 3.73 8.81 3.06
465 PB 11.11.20 Fresh Milk Monitoring 2016 3.72 8.8 3.08
465 PB 11.11.20 Fresh Milk Monitoring 2016 0.004 0.008 0.012
465 PB 11.11.20 Fresh Milk Monitoring 2016 3.73 8.81 3.07
ID Product Fat SNF Protein
1465 PB 11.11.20 Fresh Milk Monitoring 2016 4.08 8.52 3.15
1465 PB 11.11.20 Fresh Milk Monitoring 2016 4.04 8.52 3.16
1465 PB 11.11.20 Fresh Milk Monitoring 2016 0.026 0.004 0.006
1465 PB 11.11.20 Fresh Milk Monitoring 2016 4.06 8.52 3.15

What I want to get is this, I mean the last row of each set: | ID| Product | Fat | SNF | Protein | |:---- |:------:| -----:| -----:|-----:| |365 PB 11.11.20 |Fresh Milk Monitoring 2016 |4.06 |8.52 |3.2| |465 PB 11.11.20 |Fresh Milk Monitoring 2016 |3.73 |8.81 |3.07| |1465 PB 11.11.20 |Fresh Milk Monitoring 2016 |4.06 |8.52 |3.15|

Can anyone help me? Thanks.

Try:

df.loc[df.eq(df.columns).all(1).shift(-1, fill_value=True)]

Output:

                  ID                     Product   Fat   SNF Protein
3    365 PB 11.11.20  Fresh Milk Monitoring 2016  4.06  8.52     3.2
8    465 PB 11.11.20  Fresh Milk Monitoring 2016  3.73  8.81    3.07
13  1465 PB 11.11.20  Fresh Milk Monitoring 2016  3.95  8.44    3.15

Explanation : Code can be broken down like this:

meta_rows = df.eq(df.columns).all(1)

checks for the meta rows, that is all the cells in the rows are equal to the header. If the first row in your sample data is not the column names, you can use:

meta_rows = df.eq(df.iloc[0]).all(1)

Now, you want the rows before these rows, so we shift the meta row marker up:

marker = meta_rows.shift(-1, fill_value=True)

and then finally use bolean indexing to get these rows:

df[marker]

you can groupby and last:

df = df.groupby(['ID'],as_index=False).last()
>>>df

ID                  Product                     Fat     SNF     Protein
365 PB 11.11.20     Fresh Milk Monitoring 2016  4.06    8.52    3.2
465 PB 11.11.20     Fresh Milk Monitoring 2016  3.73    8.81    3.07
1465 PB 11.11.20    Fresh Milk Monitoring 2016  3.95    8.44    3.15   

if there is unwanted rows after that operations what containing the columns name add:

df = df[df['ID'] !='ID']

UPDATE note that despite this solution seems strightforward its preformance is arouns 2 times slower then @Quang Hoang's answer, so its a tradeoff of readability and performance...

i would choose readability because to me groupby seems simpler to understand... but it depends on the size of the dataset

You can first group the DataFrame based on ID , then iterate over the groups

df_grp = df.groupby(by=['ID'])

 res = []
    
for group in df_grp:
    imm_df = group[1].iloc[[:, -1]] ### returns the last row
    res += [imm_df]
        
final_df = pd.concat(res,axis=0)
   

You can further change iloc range to fetch a range of rows if needed

Perhaps it would be better to first split the CSV file into separate CSV files, one for each "chunk" of data, so that loading each of them with pandas becomes trivial.

This is a possible script to do the splitting, using more_itertools :

import re
import more_itertools as mitt


HEADER_PATTERN = re.compile(r"^ID,Product,Fat,SNF,Protein$")


with open("data.csv") as file:
    lines = iter(file.readline, "")
    chunks = mitt.split_before(lines, HEADER_PATTERN.match)
    for i, chunk in enumerate(chunks):
        with open(f"data{i}.csv", "w") as output:
            output.writelines(chunk)

Replace "data.csv" with the actual filename. This saves each of the chunks in files data0.csv , data1.csv , and so on.

Once that's done, you can simply load each of the chunks separately and extract the last row of each:

import itertools
import pandas as pd

# load each chunk
chunks = []
for i in itertools.count():
    try:
        chunk = pd.read_csv(f"data{i}.csv")
        chunks.append(chunk)
    except FileNotFoundError:
        break


# extract the last row of each
last_rows = pd.concat([df.iloc[-1:] for df in chunks])

Then:

>>> last_rows
               ID                     Product   Fat   SNF  Protein
365   PB 11.11.20  Fresh Milk Monitoring 2016  4.06  8.52     3.20
465   PB 11.11.20  Fresh Milk Monitoring 2016  3.73  8.81     3.07
1465  PB 11.11.20  Fresh Milk Monitoring 2016  3.95  8.44     3.15

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM