[英]python - fastest time to populate unique values row-wise for 8 million rows dataframe
I have a dataframe named report1
; 我有一个名为
report1
的数据report1
; with the size of 14 colums X 8 million rows. 14列X 800万行的大小。 What I would like to do is to get unique values of each row from column 3 through to column 8 and populate the result of each row onto a new dataframe named
df
. 我想做的是获取从第3列到第8列的每一行的唯一值,并将每一行的结果填充到名为
df
的新数据帧中。
report1
(source data) looks like below: report1
(源数据)如下所示:
Ticket Number Col0 Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 票号Col0 Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
100 21 30 32 3 4 6 1 5 0 100 21 30 32 3 4 6 1 5 0
101 4 9 25 3 4 6 1 5 4 101 4 9 25 3 4 6 1 5 4
102 45 33 11 3 4 6 1 5 3 102 45 33 11 3 4 6 1 5 3
… … … … … … … … … … ……………………………………
8000000 12 5 28 3 4 6 1 5 11 8000000 12 5 28 3 4 6 1 5 11
df
(new dataframe)should be like this: df
(新数据框)应如下所示:
Ticket Number Col0 Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 票号Col0 Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
100 21 30 32 3 4 6 1 5 0 100 21 30 32 3 4 6 1 5 0
101 4 9 25 3 4 6 1 5 nan 101 4 9 25 3 4 6 1 5 nan
102 45 33 11 3 4 6 1 5 nan 102 45 33 11 3 4 6 1 5 nan
… … … … … … … … … … ……………………………………
8000000 12 5 28 3 4 6 1 5 11 8000000 12 5 28 3 4 6 1 5 11
So far I have been able to get what I wanted from the simple script below but it just takes too long to run it through even if I have tried get it run under pythonanywhere platform. 到目前为止,我已经能够从下面的简单脚本中获得所需的内容,但是即使我尝试使其在pythonanywhere平台上运行,也花费了很长时间。
Anyone knows how to get this done in the shortest possible time? 有谁知道如何在最短的时间内完成这项工作?
The script is as follows: 脚本如下:
result = []
for i in range(0,7999999):
g = pd.unique(report1.iloc[i,7:13].values.ravel())
arr_list = g.tolist()
result.append(arr_list)
df = pd.DataFrame(result)
df
You need numpy
: 您需要
numpy
:
data = report1.iloc[:,4:10].values
sort_idx = np.argsort(data,axis=1)
row_offset = data.shape[1]*np.arange(data.shape[0])[:,None]
sort_lin_idx = sort_idx[:,1::] + row_offset
dup_lin_idx = sort_lin_idx[np.diff(np.sort(data,axis=1),axis=1)==0]
a = data.ravel().astype(float)
a[dup_lin_idx] = np.nan
data = a.reshape(len(data), -1)
print (pd.DataFrame(data))
0 1 2 3 4 5
0 3.0 4.0 6.0 1.0 5.0 0.0
1 3.0 4.0 6.0 1.0 5.0 NaN
2 3.0 4.0 6.0 1.0 5.0 NaN
3 3.0 4.0 6.0 1.0 5.0 11.0
Timings : 时间 :
In [117]: %timeit (orig(report1))
1 loop, best of 3: 7.48 s per loop
In [118]: %timeit (jez1(report1))
1 loop, best of 3: 4.82 s per loop
In [119]: %timeit (jez2(report1))
100 loops, best of 3: 9.57 ms per loop
Code for timings : 计时代码 :
#[40000 rows x 6 columns]
report1 = pd.concat([report1]*10000).reset_index(drop=True)
def orig(df):
result = []
for i in range(len(df.index)):
g = pd.unique(report1.iloc[i,4:10].values.ravel())
arr_list = g.tolist()
result.append(arr_list)
df = pd.DataFrame(result)
return (df)
def jez1(df):
df = report1.iloc[:,4:10]
return (df.where(~df.apply(pd.Series.duplicated, axis=1), np.nan))
def jez2(report1):
data = report1.iloc[:,4:10].values
sort_idx = np.argsort(data,axis=1)
row_offset = data.shape[1]*np.arange(data.shape[0])[:,None]
sort_lin_idx = sort_idx[:,1::] + row_offset
dup_lin_idx = sort_lin_idx[np.diff(np.sort(data,axis=1),axis=1)==0]
a = data.ravel().astype(float)
a[dup_lin_idx] = np.nan
data = a.reshape(len(data), -1)
return (pd.DataFrame(data))
print (orig(report1))
print (jez1(report1))
print (jez2(report1))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.