在python腳本中增加吞吐量

Question

我正在通過挖掘處理DNSBL中成千上萬個域名的列表，創建URL和IP的CSV。 這是一個非常耗時的過程，可能需要幾個小時。 我的服務器的DNSBL每15分鍾更新一次。 有什么辦法可以增加Python腳本的吞吐量來跟上服務器的更新？

編輯：腳本，根據要求。

import re
import subprocess as sp

text = open("domainslist", 'r')
text = text.read()
text = re.split("\n+", text)

file = open('final.csv', 'w')

for element in text:
        try:
            ip = sp.Popen(["dig", "+short", url], stdout = sp.PIPE)
            ip = re.split("\n+", ip.stdout.read())
            file.write(url + "," + ip[0] + "\n")
        except:
            pass

Answer 1

這里的絕大部分時間都花在了外部調用dig ，因此要提高速度，您需要使用多線程。 這將允許您同時運行多個調用以進行dig 。 例如，請參見：從線程Python Python Subprocess.Popen 。 或者，您可以使用Twisted（ http://twistedmatrix.com/trac/ ）。

編輯：您是對的，其中大部分是不必要的。

Answer 2

好吧，可能是名稱解析讓您花了很長時間。 如果您將其排除在外（即，如果以某種方式快速返回的話），Python應該能夠輕松處理數千個條目。

也就是說，您應該嘗試使用線程方法。 從理論上講，這將同時解析多個地址，而不是順序解析。 您也可以繼續使用dig，為此修改下面的示例代碼應該很簡單，但是為了使事情變得有趣（並希望是更多的pythonic），讓我們使用一個現有的模塊： dnspython

因此，使用以下命令進行安裝：

sudo pip install -f http://www.dnspython.org/kits/1.8.0/ dnspython

然后嘗試以下操作：

import threading
from dns import resolver

class Resolver(threading.Thread):
    def __init__(self, address, result_dict):
        threading.Thread.__init__(self)
        self.address = address
        self.result_dict = result_dict

    def run(self):
        try:
            result = resolver.query(self.address)[0].to_text()
            self.result_dict[self.address] = result
        except resolver.NXDOMAIN:
            pass


def main():
    infile = open("domainlist", "r")
    intext = infile.readlines()
    threads = []
    results = {}
    for address in [address.strip() for address in intext if address.strip()]:
        resolver_thread = Resolver(address, results)
        threads.append(resolver_thread)
        resolver_thread.start()

    for thread in threads:
        thread.join()

    outfile = open('final.csv', 'w')
    outfile.write("\n".join("%s,%s" % (address, ip) for address, ip in results.iteritems()))
    outfile.close()

if __name__ == '__main__':
    main()

如果證明同時啟動了太多線程，則可以嘗試分批執行或使用隊列（例如，請參閱http://www.ibm.com/developerworks/aix/library/au-threadingpython/）。）

Answer 3

我會考慮使用純Python庫來進行DNS查詢，而不是委托dig ，因為調用另一個進程可能會比較耗時。 （當然，在互聯網上查找任何內容也是相對耗時的，因此gilesc關於多線程的說法仍然適用）Google搜索python dns將為您提供一些入門選擇。

Answer 4

為了跟上服務器更新的步伐，執行時間必須少於15分鍾。 您的腳本運行需要15分鍾嗎？ 如果不需要15分鍾，就可以完成！

我將研究以前運行的緩存和差異，以提高性能。

在python腳本中增加吞吐量

問題描述

4 個解決方案

解決方案1
2 2010-06-22 00:57:25

解決方案2
2 已采納 2010-06-22 11:43:55

解決方案3
0 2010-06-22 01:41:22

解決方案4
0 2010-06-22 03:34:21

在python腳本中增加吞吐量

問題描述

4 個解決方案

解決方案1 2 2010-06-22 00:57:25

解決方案2 2 已采納 2010-06-22 11:43:55

解決方案3 0 2010-06-22 01:41:22

解決方案4 0 2010-06-22 03:34:21

解決方案1
2 2010-06-22 00:57:25

解決方案2
2 已采納 2010-06-22 11:43:55

解決方案3
0 2010-06-22 01:41:22

解決方案4
0 2010-06-22 03:34:21