檢查熊貓數據框中的值是否在另一個數據框中其他兩列的任意兩個值內

Question

我有兩個不同長度的數據框。 dfSamples（63012375 行）和 dfFixations（200000 行）。

dfSamples = pd.DataFrame({'tSample':[4, 6, 8, 10, 12, 14]})  
dfFixations = pd.DataFrame({'tStart':[4,12],'tEnd':[8,14]})

我想檢查 dfSamples 中的每個值是否在 dfFixations 給定的任何兩個范圍內，然后為該值分配一個標簽。 我發現了這一點：檢查數據框中的值是否在另一個數據框中的兩個值之間，但是循環解決方案非常慢，我無法使任何其他解決方案工作。

工作（但很慢）的例子：

labels = np.empty_like(dfSamples['tSample']).astype(np.chararray)
for i, fixation in dfFix.iterrows():
    log_range = dfSamples['tSample'].between(fixation['tStart'], fixation['tEnd'])
    labels[log_range] = 'fixation'
labels[labels != 'fixation'] = 'no_fixation'
dfSamples['labels'] = labels

按照此示例： Pandas 的性能應用與 np.vectorize 從現有列創建新列我試圖對其進行矢量化，但沒有成功。

def check_range(samples, tstart, tend):
    log_range = (samples > tstart) & (samples < tend)
    return log_range
fixations = list(map(check_range, dfSamples['tSample'], dfFix['tStart'], dfFix['tEnd']))

將不勝感激任何幫助！

Answer 1

將IntervalIndex.from_arrays與IntervalIndex.get_indexer一起使用，如果不匹配則返回-1 ，因此檢查並在numpy.where中設置輸出：

i = pd.IntervalIndex.from_arrays(dfFixations['tStart'],
                                 dfFixations['tEnd'], 
                                 closed="both")
pos = i.get_indexer(dfSamples['tSample'])
dfSamples['labels'] = np.where(pos != -1, "fixation", "no_fixation")

print (dfSamples)
   tSample       labels
0        4     fixation
1        6     fixation
2        8     fixation
3       10  no_fixation
4       12     fixation
5       14     fixation

性能：在理想的nice sorted不重疊數據中，實際應該是性能不同，最好測試一下。

dfSamples = pd.DataFrame({'tSample':range(10000)})  
dfFixations = pd.DataFrame({'tStart':range(0, 10000, 5),'tEnd':range(2, 10000, 5)})
    


In [165]: %%timeit
     ...: labels = np.empty_like(dfSamples['tSample']).astype(np.chararray)
     ...: for i, fixation in dfFixations.iterrows():
     ...:     log_range = dfSamples['tSample'].between(fixation['tStart'], fixation['tEnd'])
     ...:     labels[log_range] = 'fixation'
     ...: labels[labels != 'fixation'] = 'no_fixation'
     ...: dfSamples['labels'] = labels
     ...: 
     ...: 
1.25 s ± 52.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [168]: %%timeit
     ...: ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
     ...: dfSamples["labels1"] =  np.where(dfSamples["tSample"].apply(ii.contains).apply(any), "fixation", "no_fixation")
     ...: 
315 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [170]: %%timeit
     ...: ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
     ...: contained = np.logical_or.reduce(piso.contains(ii, dfSamples["tSample"], include_index=False), axis=0)
     ...: dfSamples["labels1"] = np.where(contained, "fixation", "no_fixation")
     ...: 
82.4 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [166]: %%timeit
     ...: s = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
     ...: pos = s.get_indexer(dfSamples['tSample'])
     ...: dfSamples['labels'] = np.where(pos != -1, "fixation", "no_fixation")
     ...: 
27.8 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answer 2

設置

dfSamples = pd.DataFrame({'tSample':[4, 6, 8, 10, 12, 14]})  
dfFixations = pd.DataFrame({'tStart':[4,12],'tEnd':[8,14]})

解決方案

從起點和終點創建間隔索引

ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")

ii.contains是一種檢查點是否包含在區間索引中的每個區間中的方法，例如

dfSamples["tSample"].apply(ii.contains)

給

0     [True, False]
1     [True, False]
2     [True, False]
3    [False, False]
4     [False, True]
5     [False, True]
Name: tSample, dtype: object

我們將利用這個結果，將any函數應用於每個元素（一個列表）以獲得一個pandas.Series的布爾值，然后我們可以將其與numpy.where一起使用

dfSamples["labels"] =  np.where(dfSamples["tSample"].apply(ii.contains).apply(any), "fixation", "no_fixation")

結果

   tSample       labels
0        4     fixation
1        6     fixation
2        8  no_fixation
3       10  no_fixation
4       12     fixation
5       14  no_fixation

編輯：更快的版本

使用piso v0.6.0

import piso
import numpy as np

ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
contained = np.logical_or.reduce(piso.contains(ii, dfSamples["tSample"], include_index=False), axis=0)
dfSamples["labels"] = np.where(contained, "fixation", "no_fixation")

這將在與@jezrael 的解決方案類似的時間內運行，但是它可以處理間隔重疊的情況，例如

dfFixations = pd.DataFrame({'tStart':[4,5,12],'tEnd':[8,9,14]})

注意：我是 piso 的創造者。 如果您有任何反饋或問題，請隨時與我們聯系。

檢查熊貓數據框中的值是否在另一個數據框中其他兩列的任意兩個值內

問題描述

2 個解決方案

解決方案1
3 已采納 2021-11-04 12:05:32

解決方案2
1 2021-11-03 13:27:58

檢查熊貓數據框中的值是否在另一個數據框中其他兩列的任意兩個值內

問題描述

2 個解決方案

解決方案1 3 已采納 2021-11-04 12:05:32

解決方案2 1 2021-11-03 13:27:58

解決方案1
3 已采納 2021-11-04 12:05:32

解決方案2
1 2021-11-03 13:27:58