简体   繁体   中英

Effective way to split pandas dataframe

I have a large data frame which contains more than million records. I want to split this data frame into four chunks based on customer id, So I tried following script.

unique_customer=df['Customer'].unique()
n=len(unique_customer)

a=n/4
s1=a
s2=s1+a
s3=s2+a
s4=n

l1= unique_customer[:s1]
l2= unique_customer[s1:s2]
l3= unique_customer[s2:s3]
l4= unique_customer[s3:s4]

df1=df[df['Customer'].isin(l1)]
df2=df[df['Customer'].isin(l2)]
df3=df[df['Customer'].isin(l3)]
df4=df[df['Customer'].isin(l4)]

It works fine for smaller data set. But It takes a long time for large data set. Is there any other alternative method to this problem?

Sample Data: Input:

   Customer  val1
0        a1   112
1        a2     2
2        a1    11
3        a3   154
4        a4    76
5        a5    12
6        a2     6
7        a4     7
8        a6    33
9        a5    67
10       a3   121
11       a5    21
12       a5    77
13       a4     3
14       a7    21
15       a5    65
16       a6    98
17       a8    45
18       a9    12

Output: df1

  Customer  val1
0       a1   112
1       a2     2
2       a1    11
6       a2     6

df2

   Customer  val1
3        a3   154
4        a4    76
7        a4     7
10       a3   121
13       a4     3

df3

   Customer  val1
5        a5    12
8        a6    33
9        a5    67
11       a5    21
12       a5    77
15       a5    65
16       a6    98

df4

   Customer  val1
14       a7    21
17       a8    45
18       a9    12

Since you gave no data, I'm going to try to give you an answer I haven't tested.

groups = df.groupby('Customer').ngroup() % 4
df1, df2, df3, df4 = (g for _, g in df.groupby(groups))

Though the above works, in order to match your output, I had to use pd.qcut on np.arange on the unique number of groups.

f, u = pd.factorize(df['Customer'])
q, b = pd.qcut(np.arange(len(u)), 4, retbins=True)
b[-1] += 1
groups = pd.cut(f, b, labels=False, right=False)
df1, df2, df3, df4 = (g for _, g in df.groupby(groups))

print(df1, df2, df3, df4, sep='\n'*2)

  Customer  val1
0       a1   112
1       a2     2
2       a1    11
6       a2     6

   Customer  val1
3        a3   154
4        a4    76
7        a4     7
10       a3   121
13       a4     3

   Customer  val1
5        a5    12
8        a6    33
9        a5    67
11       a5    21
12       a5    77
15       a5    65
16       a6    98

   Customer  val1
14       a7    21
17       a8    45
18       a9    12

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM