[英]Using something other than itertuples for pandas for speed
我正在遍历中等大小的数据集(441679行),而熊猫在遍历它方面做得很惨。 仅此功能就需要将近60秒,最多10分钟
def correlate(ES, PERF, JAVA_PID_CORE):
mean_results = pd.DataFrame(columns=['COREID','UID','TID','START','FINISH','TIMETAKEN','INST_M','BRANCH_K','L1I_ACCESS_M','L1D_RACCESS_M','L2D_ACCESS_M','DATA_MEM_RACCESS_M'])
sum_results = pd.DataFrame(columns=['COREID','UID','TID','START','FINISH','TIMETAKEN','INST_M','BRANCH_K','L1I_ACCESS_M','L1D_RACCESS_M','L2D_ACCESS','DATA_MEM_RACCESS_M'])
for row in ES.itertuples():
REQUEST = row[1]
THREAD = row[2]
INIT = row[3]
FIN = row[4]
TIME_TO_COMPLETE = row[5]
PID = JAVA_PID_CORE[THREAD][0]
TMP_PERF = PERF.loc[PERF['pid'] == PID]
TEST_DF = TMP_PERF[TMP_PERF['timestamp'].between(INIT, FIN, inclusive=True)]
if not TEST_DF.empty and TIME_TO_COMPLETE < 100.0:
mean_results.loc[len(mean_results)] = [core_check(JAVA_PID_CORE[THREAD][1]), REQUEST, THREAD, INIT, FIN, FIN-INIT, TEST_DF['INST_RETIRED'].mean()/1000000.0, TEST_DF['BRANCH_MISPRED'].mean()/1000.0, TEST_DF['L1I_CACHE_ACCESS'].mean()/1000000.0, TEST_DF['L1D_READ_ACCESS'].mean()/1000000.0, TEST_DF['L2D_CACHE_ACCESS'].mean()/1000000.0, TEST_DF['DATA_MEM_READ_ACCESS'].mean()/1000000.0]
sum_results.loc[len(sum_results)] = [core_check(JAVA_PID_CORE[THREAD][1]), REQUEST, THREAD, INIT, FIN, FIN-INIT, TEST_DF['INST_RETIRED'].sum()/1000000.0, TEST_DF['BRANCH_MISPRED'].sum()/1000.0, TEST_DF['L1I_CACHE_ACCESS'].sum()/1000000.0, TEST_DF['L1D_READ_ACCESS'].sum()/1000000.0, TEST_DF['L2D_CACHE_ACCESS'].sum()/1000000.0, TEST_DF['DATA_MEM_READ_ACCESS'].sum()/1000000.0]
return mean_results, sum_results
core_check
是一个简单的if循环
def core_check(ID):
if ID==0.0 or ID == 1.0:
return "b"
else:
return "r"
任何优化或优化提示将不胜感激。
一些更多信息:在ES数据帧中,我总是会从每个线程(TID)找到给定时间戳的唯一ID(UID)。 有了这个计时信息,我想在PERF数据框中检查相应的列值,并进行一些基本的数学运算(如求和,均值等)。
JAVA_PID_CORE
{80: [2690, 5], 81: [2691, 4], 83: [2693, 3], 84: [2694, 2], 85: [2695, 1], 93: [3137, 0]}
ES数据帧:
UID TID TSTAMP-INIT TSTAMP-FIN DIFF
0 !!KA 84 1494831924775 1494831925061 286
1 !#f) 83 1494831906419 1494831906446 27
2 !&YV 85 1494831920413 1494831920426 13
3 !)}{ 85 1494831926591 1494831926598 7
4 !*$W 93 1494831927342 1494831927347 5
5 !*3+ 93 1494833162404 1494833162447 43
6 !,{Q 85 1494831941291 1494831941293 2
7 !-ap 93 1494831946108 1494831946164 56
8 !.<H 93 1494831961861 1494831961887 26
9 !/Jk 93 1494832464581 1494832464585 4
10 !/k: 80 1494831913852 1494831913956 104
11 !1)6 80 1494832700278 1494832700284 6
12 !4o5 81 1494831926623 1494831926638 15
13 !6Wz 85 1494832936660 1494832936679 19
14 !7xl 83 1494831940012 1494831940423 411
15 !8~j 80 1494831905562 1494831905668 106
16 !:/# 83 1494831932570 1494831932670 100
17 !:Vb 84 1494831930895 1494831931047 152
18 !=FY 93 1494831964176 1494831964190 14
19 !@F} 83 1494831919131 1494831919170 39
20 !@Pr 81 1494831927099 1494831927106 7
21 !@Y& 85 1494831949397 1494831949458 61
22 !BY* 85 1494831953127 1494831953151 24
23 !D/5 85 1494831950950 1494831950956 6
24 !D>. 93 1494831954029 1494831954041 12
25 !DY@ 93 1494831933042 1494831933130 88
26 !No7 80 1494832598080 1494832598087 7
27 !O~t 93 1494831958937 1494831958964 27
28 !Pr$ 93 1494831956491 1494831956521 30
29 !UlC 85 1494831905536 1494831905539 3
TEST_DF
timestamp pid INST_RETIRED BRANCH_MISPRED L1I_CACHE_ACCESS \
10244 1494831924777 2694 8451572 84144 5859557
10250 1494831924797 2694 7793034 16479 4532358
10256 1494831924817 2694 9711538 5354 5479005
10262 1494831924838 2694 9417459 6447 5322698
10268 1494831924858 2694 5827656 5117 3312970
10274 1494831924878 2694 9752178 5781 5531895
10280 1494831924899 2694 9627616 5503 5440153
10286 1494831924919 2694 9680190 5305 5487293
10292 1494831924940 2694 10195290 5477 5762275
10298 1494831924961 2694 8258304 5837 4681574
10304 1494831924981 2694 9668057 7684 5447864
10310 1494831925001 2694 9676702 7085 5426614
10316 1494831925022 2694 9784358 7122 5505523
10322 1494831925042 2694 9081244 10005 5146579
PERF CSV转储
timestamp,pid,INST_RETIRED,BRANCH_MISPRED,L1I_CACHE_ACCESS,L1D_READ_ACCESS,L2D_CACHE_ACCESS,DATA_MEM_READ_ACCESS
1494831906349,3137,29998089,18347,8765597,8004347,372144,8003127,
1494831906350,2695,29794795,16212,8559232,8431582,425171,8430788,
1494831906350,2694,6030818,22909,3737737,0,245017,0,
1494831906350,2693,6146912,9282,3531230,0,186687,0,
1494831906350,2691,6654263,6256,3806089,0,91580,0,
1494831906350,2690,6235079,16255,3700410,0,199919,0,
1494831906370,3137,10177539,52101,3930006,2660383,563205,2657417,
1494831906370,2695,26730045,23757,7939065,7177029,430927,7175600,
1494831906370,2694,4835318,48355,3394955,0,354923,0,
1494831906370,2693,6188160,8343,3524848,0,172268,0,
1494831906370,2691,6579932,6936,3746719,0,97691,0,
1494831906370,2690,5339960,42454,3553089,0,323373,0,
1494831906390,3137,22703263,74115,8000304,6295892,926318,6300728,
1494831906391,2695,24147175,76240,8193916,6787613,849869,6789710,
1494831906391,2694,7059747,46404,4567632,0,395898,0,
1494831906391,2693,8378296,13639,4796995,0,242115,0,
1494831906391,2691,9031591,11004,5124851,0,149132,0,
1494831906391,2690,5986551,69506,4330165,0,553982,0,
1494831906411,3137,12902656,52133,4746982,3564058,570613,3559191,
1494831906411,2695,23827520,12880,6706918,6908672,357731,6908573,
ES.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42532 entries, 0 to 42531
Data columns (total 5 columns):
UID 42532 non-null object
TID 42532 non-null int64
TSTAMP-INIT 42532 non-null int64
TSTAMP-FIN 42532 non-null int64
DIFF 42532 non-null int64
dtypes: int64(4), object(1)
memory usage: 1.6+ MB
None
PERF.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440117 entries, 1 to 440117
Data columns (total 8 columns):
timestamp 440117 non-null int64
pid 440117 non-null int64
PERF_COUNT_SW_TASK_CLOCK 440117 non-null int64
PERF_COUNT_SW_PAGE_FAULTS 440117 non-null int64
PERF_COUNT_SW_CONTEXT_SWITCHES 440117 non-null int64
PERF_COUNT_SW_CPU_MIGRATIONS 440117 non-null int64
PERF_COUNT_SW_PAGE_FAULTS_MIN 440117 non-null int64
PERF_COUNT_SW_PAGE_FAULTS_MAJ 440117 non-null int64
dtypes: int64(8)
memory usage: 26.9 MB
None
PERF [ 'PID']。value_counts()
2694 73353
2695 73353
2693 73353
2690 73353
2691 73353
3137 73352
Name: pid, dtype: int64
您可以与join
工作
ES['core'] = ES['TID'].apply(lambda x: core_check(JAVA_PID_CORE[x][0]))
ES['pid'] = ES['TID'].apply(lambda x: JAVA_PID_CORE[x][0])
如果ES-index是唯一的,则没有必要,那么ES = ES.reset_index()
也一样
ES['es_id'] = range(len(ES))
要么
ES.insert(0,'es_id', range(len(ES)))
取决于位置是否重要
ES
和PERF
然后选择正确的行
TMP_PERF = pd.merge(PERF, ES, on='pid')
产生一个108 x 15的DataFrame
TEST_DF = TMP_PERF[(TMP_PERF['TSTAMP-INIT'] < TMP_PERF['timestamp'] + 100) & (TMP_PERF['timestamp'] < TMP_PERF['TSTAMP-FIN'] + 100) & (TMP_PERF['DIFF'] < 100)]
在这里,我从init
< timestamp
< fin
更改为连续比较。 我添加了+100,因为在示例数据集中没有返回任何行
timestamp pid INST_RETIRED BRANCH_MISPRED L1I_CACHE_ACCESS L1D_READ_ACCESS L2D_CACHE_ACCESS DATA_MEM_READ_ACCESS UID TID TSTAMP-INIT TSTAMP-FIN DIFF es_id
78 1494831906350 2693 6146912 9282 3531230 0 186687 0 !#f) 83 1494831906419 1494831906446 27 1
82 1494831906370 2693 6188160 8343 3524848 0 172268 0 !#f) 83 1494831906419 1494831906446 27 1
86 1494831906391 2693 8378296 13639 4796995 0 242115 0 !#f) 83 1494831906419 1494831906446 27 1
results_df_sum = TEST_DF[['es_id', 'INST_RETIRED', 'BRANCH_MISPRED', 'L1I_CACHE_ACCESS', 'L1D_READ_ACCESS', 'L2D_CACHE_ACCESS',
results_df_mean = TEST_DF[['es_id', 'INST_RETIRED', 'BRANCH_MISPRED', 'L1I_CACHE_ACCESS', 'L1D_READ_ACCESS', 'L2D_CACHE_ACCESS', 'DATA_MEM_READ_ACCESS']].groupby('initial_row_es').mean().reset_index()
final_result_sum = pd.merge(ES, results_df_sum, on='es_id', how='inner')
final_result_mean = pd.merge(ES, results_df_mean, on='es_id', how='inner')
final_result_sum
UID TID TSTAMP-INIT TSTAMP-FIN DIFF core pid es_id INST_RETIRED BRANCH_MISPRED L1I_CACHE_ACCESS L1D_READ_ACCESS L2D_CACHE_ACCESS DATA_MEM_READ_ACCESS
0 !#f) 83 1494831906419 1494831906446 27 r 2693 1 20713368 31264 11853073 0 601070 0
final_result_mean
UID TID TSTAMP-INIT TSTAMP-FIN DIFF core pid es_id INST_RETIRED BRANCH_MISPRED L1I_CACHE_ACCESS L1D_READ_ACCESS L2D_CACHE_ACCESS DATA_MEM_READ_ACCESS
0 !#f) 83 1494831906419 1494831906446 27 r 2693 1 6904456.0 10421.333333 3.951024e+06 0.0 200356.666667 0.0
DataFrame
与其花费大量时间进行DataFrame
并随后进行选择,不如使用最少的DataFrame
,进行选择并再次与初始数据帧进行连接,从而减少了内存占用
在这里,我将列名initial_row_es
替换为es_id
,并增加一列perf_id
。 如果PERF
和ES
的索引是唯一的,则没有必要,您可以使用它代替这些额外的列
ES = pd.read_csv(StringIO(ES_str), sep='\s+') # or your way of getting this DataFrame
ES['core'] = ES['TID'].apply(lambda x: core_check(JAVA_PID_CORE[x][0]))
ES['pid'] = ES['TID'].apply(lambda x: JAVA_PID_CORE[x][0])
ES.insert(0,'es_id', range(len(ES)))
PERF = pd.read_csv(StringIO(PERF_str)).dropna(how='all', axis=1)
PERF['perf_id'] = range(len(PERF))
es_min = ES[['es_id', 'pid', 'TSTAMP-INIT', 'TSTAMP-FIN', 'DIFF']]
perf_min = PERF[['perf_id', 'pid', 'timestamp']]
df_min = pd.merge(perf_min, es_min, on='pid')
df_min2 = df_min[(df_min['TSTAMP-INIT'] < df_min['timestamp'] + 100) & (df_min['timestamp'] < df_min['TSTAMP-FIN'] + 100) & (df_min['DIFF'] < 100)]
TEST_DF = df_min2[['perf_id', 'es_id']].pipe(pd.merge, ES, on='es_id').pipe(pd.merge, PERF, on='perf_id')
您还可以将PERf
分成每个pid
块,进行选择,然后将那些较小的块连接起来
def join_in_chunks(perf, es):
for p, chunk in perf.groupby('pid'):
df = pd.merge(chunk, es, on='pid')
yield df[(df['TSTAMP-INIT'] < df['timestamp'] + 100) & (df['timestamp'] < df['TSTAMP-FIN'] + 100) & (df['DIFF'] < 100)]
TEST_DF = pd.concat(join_in_chunks(PERF, ES), ignore_index=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.