I want to pool a data like the figure above, but it takes too much time and RAM usage. Can I make it faster / efficient?
My code is like this:
data = df.groupby(['Name', 'Age', 'Pet', 'Allergy']).apply(lambda x: pd.Series(range(x['Amount'].squeeze()))).reset_index()
data = df.groupby(['Name', 'Age', 'Pet', 'Allergy']).apply(lambda x: pd.Series(range(x['Amount'].squeeze()))).reset_index()[['Name', 'Age', 'Pet', 'Allergy']]
enter image description here It's kind of an abbreviated form, but my actual dataset is 3.5GB.. So it takes really long time. I wonder if there's any other way to do this work more fast.
I'd appreciate any help! Thank you!
You could preallocate the final dataframe, then iterate the original dataframe, reassigning rows in the final.
import pandas as pd
import numpy as np
df = pd.DataFrame({"Name":["Male", "Female"],
"Age":[29, 43], "Pet":["Cat", "Dog"],
"Allergy":["Negative", "Positive"],
"Amount":[2, 4]})
amounts = df["Amount"]
df.drop("Amount", axis=1, inplace=True)
counts = amounts.sum()
new_df = pd.DataFrame(columns=df.columns, index=np.arange(counts))
new_index = 0
for amount, (_, row) in zip(amounts, df.iterrows()):
for i in range(new_index, new_index+amount):
new_df.iloc[i] = row
new_index = new_index+amount
del df, amounts, row
print(new_df)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.