简体   繁体   中英

Is there any faster/ less RAM using way to pool the data using Python?

enter image description here

https://kin-phinf.pstatic.net/20221001_267/1664597566757fY2pz_PNG/%C8%AD%B8%E9_%C4%B8%C3%B3_2022-10-01_001049.png?type=w750

I want to pool a data like the figure above, but it takes too much time and RAM usage. Can I make it faster / efficient?

My code is like this:

data = df.groupby(['Name', 'Age', 'Pet', 'Allergy']).apply(lambda x: pd.Series(range(x['Amount'].squeeze()))).reset_index()
data = df.groupby(['Name', 'Age', 'Pet', 'Allergy']).apply(lambda x: pd.Series(range(x['Amount'].squeeze()))).reset_index()[['Name', 'Age', 'Pet', 'Allergy']]

enter image description here It's kind of an abbreviated form, but my actual dataset is 3.5GB.. So it takes really long time. I wonder if there's any other way to do this work more fast.

I'd appreciate any help! Thank you!

You could preallocate the final dataframe, then iterate the original dataframe, reassigning rows in the final.

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name":["Male", "Female"],
    "Age":[29, 43], "Pet":["Cat", "Dog"],
    "Allergy":["Negative", "Positive"],
    "Amount":[2, 4]})

amounts = df["Amount"]
df.drop("Amount", axis=1, inplace=True)
counts = amounts.sum()

new_df = pd.DataFrame(columns=df.columns, index=np.arange(counts))
new_index = 0

for amount, (_, row) in zip(amounts, df.iterrows()):
    for i in range(new_index, new_index+amount):
        new_df.iloc[i] = row
    new_index = new_index+amount

del df, amounts, row

print(new_df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM