I have a large data frame which contains more than million records. I want to split this data frame into four chunks based on customer id, So I tried following script.
unique_customer=df['Customer'].unique()
n=len(unique_customer)
a=n/4
s1=a
s2=s1+a
s3=s2+a
s4=n
l1= unique_customer[:s1]
l2= unique_customer[s1:s2]
l3= unique_customer[s2:s3]
l4= unique_customer[s3:s4]
df1=df[df['Customer'].isin(l1)]
df2=df[df['Customer'].isin(l2)]
df3=df[df['Customer'].isin(l3)]
df4=df[df['Customer'].isin(l4)]
It works fine for smaller data set. But It takes a long time for large data set. Is there any other alternative method to this problem?
Sample Data: Input:
Customer val1
0 a1 112
1 a2 2
2 a1 11
3 a3 154
4 a4 76
5 a5 12
6 a2 6
7 a4 7
8 a6 33
9 a5 67
10 a3 121
11 a5 21
12 a5 77
13 a4 3
14 a7 21
15 a5 65
16 a6 98
17 a8 45
18 a9 12
Output: df1
Customer val1
0 a1 112
1 a2 2
2 a1 11
6 a2 6
df2
Customer val1
3 a3 154
4 a4 76
7 a4 7
10 a3 121
13 a4 3
df3
Customer val1
5 a5 12
8 a6 33
9 a5 67
11 a5 21
12 a5 77
15 a5 65
16 a6 98
df4
Customer val1
14 a7 21
17 a8 45
18 a9 12
Since you gave no data, I'm going to try to give you an answer I haven't tested.
groups = df.groupby('Customer').ngroup() % 4
df1, df2, df3, df4 = (g for _, g in df.groupby(groups))
Though the above works, in order to match your output, I had to use pd.qcut
on np.arange
on the unique number of groups.
f, u = pd.factorize(df['Customer'])
q, b = pd.qcut(np.arange(len(u)), 4, retbins=True)
b[-1] += 1
groups = pd.cut(f, b, labels=False, right=False)
df1, df2, df3, df4 = (g for _, g in df.groupby(groups))
print(df1, df2, df3, df4, sep='\n'*2)
Customer val1
0 a1 112
1 a2 2
2 a1 11
6 a2 6
Customer val1
3 a3 154
4 a4 76
7 a4 7
10 a3 121
13 a4 3
Customer val1
5 a5 12
8 a6 33
9 a5 67
11 a5 21
12 a5 77
15 a5 65
16 a6 98
Customer val1
14 a7 21
17 a8 45
18 a9 12
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.