Rowise 比較兩個熊貓數據框

Question

我有兩個熊貓數據框

flows:
------
sourceIPAddress     destinationIPAddress    flowStartMicroseconds       flowEndMicroseconds 
163.193.204.92      40.8.121.226            2021-05-01 07:00:00.113     2021-05-01 07:00:00.113962
104.247.103.181     163.193.124.92          2021-05-01 07:00:00.074     2021-05-01 07:00:00.101026
17.254.170.53       163.193.124.133         2021-05-01 07:00:00.077     2021-05-01 07:00:00.083874
18.179.96.152       203.179.250.96          2021-05-01 07:00:00.112     2021-05-01 07:00:00.098296
133.103.144.34      13.154.212.11           2021-05-01 07:00:00.101     2021-05-01 07:00:00.112013

attacks:
--------
datetime                    srcIP           dstIP
2021-05-01 07:00:00.055210  188.67.130.72   133.92.239.153   
2021-05-01 07:00:00.055500  45.100.34.74    203.179.180.153   
2021-05-01 07:00:00.055351  103.113.29.26   163.193.242.75   
2021-05-01 07:00:00.056209  128.215.229.101 163.193.94.194   
2021-05-01 07:00:00.055258  45.111.22.11    163.193.138.139

我想檢查每一行流是否匹配任何攻擊行

attacks[srcIP] == flows[srcIP] || attacks[srcIP] == flows[destIP]
&&
attacks[destIP] == flows[srcIP] || attacks[destIP] == flows[destIP]
&&
attacks[datetime] between flows[flowStartMicroseconds] and flows[flowEndMicroseconds]

有沒有比僅僅迭代它更有效的方法來做到這一點？

編輯：數據框非常大。 我包括了每個的 head() 。

flows = {'sourceIPAddress': {510: '163.193.204.92',
  564: '104.247.103.181',
  590: '17.254.170.53',
  599: '18.179.96.152',
  1149: '133.103.144.34'},
 'destinationIPAddress': {510: '40.8.121.226',
  564: '163.193.124.92',
  590: '163.193.124.133',
  599: '203.179.250.96',
  1149: '13.154.212.11'},
 'flowStartMicroseconds': {510: Timestamp('2021-05-01 07:00:00.113000'),
  564: Timestamp('2021-05-01 07:00:00.074000'),
  590: Timestamp('2021-05-01 07:00:00.077000'),
  599: Timestamp('2021-05-01 07:00:00.112000'),
  1149: Timestamp('2021-05-01 07:00:00.101000')},
 'flowEndMicroseconds': {510: Timestamp('2021-05-01 07:00:00.113962'),
  564: Timestamp('2021-05-01 07:00:00.083874'),
  590: Timestamp('2021-05-01 07:00:00.098296'),
  599: Timestamp('2021-05-01 07:00:00.112013'),
  1149: Timestamp('2021-05-01 07:00:00.101026')}}

attacks = {'datetime': {0: Timestamp('2021-05-01 07:00:00.055210'),
  1: Timestamp('2021-05-01 07:00:00.055500'),
  2: Timestamp('2021-05-01 07:00:00.055351'),
  3: Timestamp('2021-05-01 07:00:00.056209'),
  4: Timestamp('2021-05-01 07:00:00.055258')},
 'srcIP': {0: '188.67.130.72',
  1: '45.100.34.74',
  2: '103.113.29.26',
  3: '128.215.229.101',
  4: '45.111.22.11'},
 'dstIP': {0: '133.92.239.153',
  1: '203.179.180.153',
  2: '163.193.242.75',
  3: '163.193.94.194',
  4: '163.193.138.139'}}

Answer 1

在兩個數據框之間使用左連接合並，然后查找數據的交集。

Answer 2

我不確定性能，但我會繼續如下。

為此，只有兩種 IP 類型，攻擊 IP 和流 IP。 所以我會重新索引這兩個 DF 以具有以下格式
flow_df : (flow_IPAddress, flowStartMicroseconds, flowEndMicroseconds)
Attack_df: (attack_IP, 日期時間)
然后我會使用內連接合並它們（left_on = "flow_IPAddress", right_on = "attack_IP"）
然后我會查詢結果以僅過濾有效的時間戳（例如使用您上面寫的語句。）

那么生成的 df 將如下所示：

flowIPAddress            attack_IP            flowStartMicroseconds            flowEndMicroseconds            datetime  
163.193.204.92      40.8.121.226            2021-05-01 07:00:00.113     2021-05-01 07:00:00.113962 2021-05-01 07:00:00.055210
104.247.103.181     163.193.124.92          2021-05-01 07:00:00.074     2021-05-01 07:00:00.101026 2021-05-01 07:00:00.055210

Answer 3

解決方案：數據庫

我的解決方案是將兩個數據幀導入 PostgreSQL 並為前向和后向 IP 匹配創建兩個新表，然后將它們全部聯合起來。

兩個單獨的聯接比執行一個巨大的聯接要快得多。

create table attacks_forward as 
SELECT
flows.*, attacks."label", attacks."sublabel"
FROM
    flows
JOIN attacks 
    ON flows."sourceIPAddress" = attacks."srcIP" 
    and flows."destinationIPAddress" = attacks."dstIP"
    and attacks."datetime" between flows."flowStartMicroseconds" and flows."flowEndMicroseconds";
    
   
create table attacks_backward as 
SELECT
flows.*, attacks."label", attacks."sublabel"
FROM
    flows
JOIN attacks 
    ON flows."sourceIPAddress" = attacks."dstIP" 
    and flows."destinationIPAddress" = attacks."srcIP"
    and attacks."datetime" between flows."flowStartMicroseconds" and flows."flowEndMicroseconds";

create table attacks_flows as 
SELECT * FROM attacks_forward
UNION ALL
SELECT * FROM attacks_backward;

Rowise 比較兩個熊貓數據框

問題描述

3 個解決方案

解決方案1
0 2021-11-11 16:18:39

解決方案2
0 2021-11-11 22:23:57

解決方案3
0 已采納 2021-11-16 14:51:50

Rowise 比較兩個熊貓數據框

問題描述

3 個解決方案

解決方案1 0 2021-11-11 16:18:39

解決方案2 0 2021-11-11 22:23:57

解決方案3 0 已采納 2021-11-16 14:51:50

解決方案1
0 2021-11-11 16:18:39

解決方案2
0 2021-11-11 22:23:57

解決方案3
0 已采納 2021-11-16 14:51:50