多線程的引入不會減少 Python 程序的執行時間

Question

我是 python 的新手，我第一次使用它來處理 pcap 文件。 到目前為止，我已經有了一個程序，它可以過濾掉屬於特定 IP 和 PROTOCOL 的數據包，並將它們寫入一個新的 pcap 文件。

from scapy.all import *
import re
import glob

def process_pcap(path, hosts, ports):
    pktdump = PcapWriter("temp11.pcap", append=True, sync=True)
    count=0;
    for pcap in glob.glob(os.path.join(path, '*.pcapng')):
        print "Reading file", pcap
        packets=rdpcap(pcap)
        for pkt in packets:
            if (TCP in pkt and (pkt[TCP].sport in ports or pkt[TCP].dport in ports)):
                if (pkt[IP].src in hosts or pkt[IP].dst in hosts):
                    count=count+1
                    print "Writing packets " , count
                    #wrpcap("temp.pcap", pkt)
                    pktdump.write(pkt)


path="\workspace\pcaps"
file_ip = open('ip_list.txt', 'r') #Text file with many ip address
o = file_ip.read()
hosts = re.findall( r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", o )
ports=[443] # Protocols to be added in filter
process_pcap(path, hosts, ports)

此代碼花費的時間太長，因為它需要匹配的 IP 列表可以是 1000 個 IP，並且目錄中的 pcap 文件也可以是千兆字節。 這就是為什么有必要引入多線程。 為此，我將代碼更改如下；

from scapy.all import *
import re
import glob
import threading


def process_packet(pkt, pktdump, packets, ports):
count = 0
if (TCP in pkt and (pkt[TCP].sport in ports or pkt[TCP].dport in ports)):
            if (pkt[IP].src in hosts or pkt[IP].dst in hosts):
                count=count+1
                print "Writing packets " , count
                #wrpcap("temp.pcap", pkt)
                pktdump.write(pkt)  


def process_pcap(path, hosts, ports):
pktdump = PcapWriter("temp11.pcap", append=True, sync=True)
ts=list()
for pcap in glob.glob(os.path.join(path, '*.pcapng')):
    print "Reading file", pcap
    packets=rdpcap(pcap)
    for pkt in packets:
         t=threading.Thread(target=process_packet,args=(pkt,pktdump, packets,ports,))
         ts.append(t)
         t.start()
for t in ts:
    t.join()    


path="\workspace\pcaps"
file_ip = open('ip_list.txt', 'r') #Text file with many ip address
o = file_ip.read()
hosts = re.findall( r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", o )
ports=[443] # Protocos to be added in filter
process_pcap(path, hosts, ports)

但我認為我沒有以最好的方式做到這一點，因為時間並沒有減少。

請有任何建議！

編輯：

我已經根據響應更改了代碼，運行時我很糟糕，但線程並沒有自行終止。 python 中關於多線程的所有示例都不需要顯式終止線程。 請查明此代碼中的問題；

from scapy.all import *
import re
import glob
import threading
import Queue
import multiprocessing

#global variables declaration

path="\pcaps"
pcapCounter = len(glob.glob1(path,"*.pcapng")) #size of the queue
q = Queue.Queue(pcapCounter) # queue to hold all pcaps in directory
pcap_lock = threading.Lock()
ports=[443] # Protocols to be added in filter


def safe_print(content):
    print "{0}\n".format(content),

def process_pcap (hosts):
    content = "Thread no ", threading.current_thread().name, " in action"
    safe_print(content)
    if not q.empty():
        with pcap_lock:
            content = "IN LOCK ", threading.current_thread().name
            safe_print(content)
            pcap=q.get()

        content = "OUT LOCK", threading.current_thread().name, " and reading packets from ", pcap
        safe_print(content)   
        packets=rdpcap(pcap)


        pktdump = PcapWriter(threading.current_thread().name+".pcapng", append=True, sync=True)
        pList=[]
        for pkt in packets:
            if (TCP in pkt and (pkt[TCP].sport in ports or pkt[TCP].dport in ports)):
                if (pkt[IP].src in hosts or pkt[IP].dst in hosts):
                    pList.append(pkt)

                    content="Wrting Packets to pcap ", threading.current_thread().name
                    safe_print(content)
                    pktdump.write(pList) 


else:
    content = "DONE!! QUEUE IS EMPTY", threading.current_thread().name
    safe_print(content)


for pcap in glob.glob(os.path.join(path, '*.pcapng')):
    q.put(pcap)

file_ip = open('ip_list.txt', 'r') #Text file with many ip addresses
o = file_ip.read()
hosts = re.findall( r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", o )
threads = []
cpu = multiprocessing.cpu_count() 
for i in range(cpu):
    t = threading.Thread(target=process_pcap, args=(hosts,), name = i)
    t.start()
    threads.append(t)

for t in threads:
    t.join()


print "Exiting Main Thread"

這是對上述程序的回應； 它從不打印“退出主線程”

('Thread no ', 'Thread-1', ' in action')
('Thread no ', '3', ' in action')
('Thread no ', '1', ' in action')
('Thread no ', '2', ' in action')
('IN LOCK ', 'Thread-1')
('IN LOCK ', '3')
('OUT LOCK', 'Thread-1', ' and reading packets from ', 'path to\\test.pcapng')
('OUT LOCK', '3', ' and reading packets from ', 'path to\\test11.pcapng')
('IN LOCK ', '1')
('Wrting Packets to pcap ', '3')
('Wrting Packets to pcap ', 'Thread-1')

編輯 2：我在長度檢查之前鎖定了隊列，一切正常。

謝謝你。

Answer 1

您正在為每個數據包創建一個線程。 這是根本問題。

此外，您正在對每個已處理的數據包執行 I/O 步驟，而不是寫入一批數據包

您的 PC 上可能有 1-10 個內核。 對於您正在處理的數據包計數，創建 1000 多個線程的開銷超過了每個內核的並行度值。 有一個非常快的收益遞減規律，運行線程多於可用內核。

這是一種更好的方法，您將在其中意識到並行性的好處。

主線程創建一個全局隊列和鎖以供后續線程共享。 在創建任何線程之前，主線程會枚舉*.pcapng文件列表並將每個文件名放入隊列中。 它還讀取 IP 地址列表以及用於過濾數據包。

然后產生N個線程。 其中 N 是您設備上的內核數 (N = os.cpu_count() )。

每個線程進入一個鎖，將下一個文件從主線程建立的隊列中彈出，然后釋放鎖。 然后線程將文件讀入packets列表並刪除它不需要的文件。 然后保存回一個單獨的唯一文件，該文件代表原始輸入文件的過濾結果。 理想情況下，pktdump object 支持一次寫回多個數據包，因為批處理 I/O 操作可以節省大量時間。

線程處理完單個文件后，重新進入鎖，從隊列中彈出下一個文件，釋放鎖，對下一個文件重復處理。

當文件名隊列為空時，線程退出。

主線程等待所有 N 個線程完成。 現在你有了一整套要合並的 K 文件。 您的主線程只需要重新打開線程創建的這些 K 文件並將每個文件連接回單個 output 文件。

Answer 2

這就是 python 與線程一起工作的方式，請閱讀GIL 。 如果要並行執行，則應使用multiprocessing

多線程的引入不會減少 Python 程序的執行時間

問題描述

2 個解決方案

解決方案1
3 已采納 2019-11-07 08:03:56

解決方案2
2 2019-11-07 07:51:43

多線程的引入不會減少 Python 程序的執行時間

問題描述

2 個解決方案

解決方案1 3 已采納 2019-11-07 08:03:56

解決方案2 2 2019-11-07 07:51:43

解決方案1
3 已采納 2019-11-07 08:03:56

解決方案2
2 2019-11-07 07:51:43