[英]Twisted/Python - processing a large file line by line
Have this code that reads a file and process it. 使用此代码读取文件并进行处理。 The file is quite big, 12 million lines, so currently I split it manually into 1000 lines file and start each process sequentially for each 1000 lines (bash script).
该文件很大,有1200万行,因此目前我将其手动拆分为1000行文件,并针对每1000行(bash脚本)按顺序启动每个进程。
Is there a way to use Twisted to load a file and process it by 1000 items from one file (progress bar would be nice) without the need for me to split it manually? 有没有一种方法可以使用Twisted加载文件并从一个文件中处理1000个项目(进度条会很好),而无需我手动拆分它?
scanner.py 扫描仪
import argparse
from tqdm import tqdm
from sys import argv
from pprint import pformat
from twisted.internet.task import react
from twisted.web.client import Agent, readBody
from twisted.web.http_headers import Headers
import lxml.html
from geoip import geolite2
import pycountry
from tld import get_tld
import json
import socket
poweredby = ""
server = ""
ip = ""
def cbRequest(response, url):
global poweredby, server, ip
# print 'Response version:', response.version
# print 'Response code:', response.code
# print 'Response phrase:', response.phrase
# print 'Response headers:'
# print pformat(list(response.headers.getAllRawHeaders()))
poweredby = response.headers.getRawHeaders("X-Powered-By")[0]
server = response.headers.getRawHeaders("Server")[0]
#print poweredby
#print server
d = readBody(response)
d.addCallback(cbBody, url)
return d
def cbBody(body, ourl):
global poweredby, server,ip
#print body
html_element = lxml.html.fromstring(body)
generator = html_element.xpath("//meta[@name='generator']/@content")
ip = socket.gethostbyname(ourl)
try:
match = geolite2.lookup(ip)
if match is not None:
country = match.country
try:
c = pycountry.countries.lookup(country)
country = c.name
except:
country = ""
except:
country = ""
try:
res = get_tld("http://www" + ourl, as_object=True)
tld = res.suffix
except:
tld = ""
try:
match = re.search(r'[\w\.-]+@[\w\.-]+', body)
email = match.group(0)
except:
email = ""
permalink=ourl.rstrip().replace(".","-")
try:
item = generator[0]
val = "{ \"Domain\":" + json.dumps(
"http://" + ourl.rstrip()) + ",\"IP\":\"" + ip + "\",\"Server\":" + json.dumps(
str(server)) + ",\"PoweredBy\":" + json.dumps(
str(poweredby)) + ",\"MetaGenerator\":" + json.dumps(item) + ",\"Email\":" + json.dumps(
email) + ",\"Suffix\":\"" + tld + "\",\"CountryHosted\":\"" + country+"\",\"permalink\":\""+permalink+"\" }"
except:
val = "{ \"Domain\":" + json.dumps(
"http://" + ourl.rstrip()) + ",\"IP\":\"" + ip + "\"," + "\"Server\":" + json.dumps(
str(server)) + ",\"PoweredBy\":" + json.dumps(
str(poweredby)) + ",\"MetaGenerator\":\"\",\"Email\":" + json.dumps(
email) + ",\"Suffix\":\"" + tld + "\",\"CountryHosted\":\"" + country+"\",\"permalink\":\""+permalink+"\" }"
print val
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Scanner v0.99')
parser.add_argument(
'-i', '--input', help='Input list of domains', required=True)
args = parser.parse_args()
input = args.input
with open(input) as f:
urls = f.read().splitlines()
def mainjob(reactor, urls=urls):
for url in tqdm(urls):
agent = Agent(reactor)
d = agent.request(
'GET', "http://" + url,
Headers({'User-Agent': ['bot']}),
None)
d.addCallback(cbRequest, url)
d.addErrback(lambda x: None) # ignore errors
return d
react(mainjob, argv[3:])
Update 1: 更新1:
Now I execute it like this: 现在,我像这样执行它:
file.txt - 12,000,000 lines file.txt-12,000,000行
chunk01.txt - file with 1000 lines . chunk01.txt-具有1000行的文件。 .
。 .
。
I execute for each chunk file a script. 我为每个块文件执行一个脚本。
python scanner.py chunk01.txt
python scanner.py chunk02.txt
.
.
.
Want to execute script once: 想要执行一次脚本:
python scanner.py file.txt
The problem lies, that I need to pass urls as arguments to react(). 问题在于,我需要将url作为参数传递给react()。 If I read it to memory (via f.read()) as 12,000,000 file it is too big.
如果我将它读入内存(通过f.read())为12,000,000个文件,则它太大。 Hence I splitted the file and execute script on each small file.
因此,我分割了文件并在每个小文件上执行了脚本。
Hope it is now clearer... 希望现在更加清晰...
Update 2: 更新2:
Based on @Jean-Paul Calderone answer, I cooked this code. 基于@ Jean-Paul Calderone的答案,我编写了这段代码。
It seems to work, however I am bumped since on: 它似乎可行,但是从那以后我一直感到震惊:
180,000 iterations.... I would assume 180,000 domains (each line from input file), the script has only printed/outputted ca. 180,000次迭代...。我假设有180,000个域(输入文件中的每一行),该脚本仅打印/输出了ca。 35707 lines (entries).
35707行(条目)。 I would expect it to be something close to 180,000 ... I know some domains will time out.
我希望它接近18万...我知道某些域会超时。 When I run it the "old" way, it was more consistent, the number was closer ie number of input domains was close to outputed lines in output file.
当我以“旧”方式运行它时,它更加一致,数量更接近,即输入域的数量接近输出文件中的输出行。
Can there something be "bad" with the code? 代码中是否存在“不良”之处? Any ideas?
有任何想法吗?
python scanner.py > out.txt
181668it [1:47:36, 4.82it/s]
and counting the lines: 并计算行数:
wc -l out.txt
36840 out.txt
scanner.py 扫描仪
import argparse
from tqdm import tqdm
from sys import argv
from pprint import pformat
from twisted.internet.task import react
from twisted.web.client import Agent, readBody
from twisted.web.http_headers import Headers
from twisted.internet.task import cooperate
from twisted.internet.defer import gatherResults
import lxml.html
from geoip import geolite2
import pycountry
from tld import get_tld
import json
import socket
poweredby = ""
server = ""
ip = ""
def cbRequest(response, url):
global poweredby, server, ip
# print 'Response version:', response.version
# print 'Response code:', response.code
# print 'Response phrase:', response.phrase
# print 'Response headers:'
# print pformat(list(response.headers.getAllRawHeaders()))
poweredby = response.headers.getRawHeaders("X-Powered-By")[0]
server = response.headers.getRawHeaders("Server")[0]
#print poweredby
#print server
d = readBody(response)
d.addCallback(cbBody, url)
return d
def cbBody(body, ourl):
global poweredby, server,ip
#print body
html_element = lxml.html.fromstring(body)
generator = html_element.xpath("//meta[@name='generator']/@content")
ip = socket.gethostbyname(ourl)
try:
match = geolite2.lookup(ip)
if match is not None:
country = match.country
try:
c = pycountry.countries.lookup(country)
country = c.name
except:
country = ""
except:
country = ""
try:
res = get_tld("http://www" + ourl, as_object=True)
tld = res.suffix
except:
tld = ""
try:
match = re.search(r'[\w\.-]+@[\w\.-]+', body)
email = match.group(0)
except:
email = ""
permalink=ourl.rstrip().replace(".","-")
try:
item = generator[0]
val = "{ \"Domain\":" + json.dumps(
"http://" + ourl.rstrip()) + ",\"IP\":\"" + ip + "\",\"Server\":" + json.dumps(
str(server)) + ",\"PoweredBy\":" + json.dumps(
str(poweredby)) + ",\"MetaGenerator\":" + json.dumps(item) + ",\"Email\":" + json.dumps(
email) + ",\"Suffix\":\"" + tld + "\",\"CountryHosted\":\"" + country+"\",\"permalink\":\""+permalink+"\" }"
except:
val = "{ \"Domain\":" + json.dumps(
"http://" + ourl.rstrip()) + ",\"IP\":\"" + ip + "\"," + "\"Server\":" + json.dumps(
str(server)) + ",\"PoweredBy\":" + json.dumps(
str(poweredby)) + ",\"MetaGenerator\":\"\",\"Email\":" + json.dumps(
email) + ",\"Suffix\":\"" + tld + "\",\"CountryHosted\":\"" + country+"\",\"permalink\":\""+permalink+"\" }"
print val
def main(reactor, url_path):
urls = open(url_path)
return mainjob(reactor, (url.strip() for url in urls))
def mainjob(reactor, urls=argv[2:]):
#for url in urls:
# print url
agent = Agent(reactor)
work = (process(agent, url) for url in tqdm(urls))
tasks = list(cooperate(work) for i in range(100))
return gatherResults(list(task.whenDone() for task in tasks))
def process(agent, url):
d = agent.request(
'GET', "http://" + url,
Headers({'User-Agent': ['bot']}),
None)
d.addCallback(cbRequest, url)
d.addErrback(lambda x: None) # ignore errors
return d
react(main, ["./domains.txt"])
Update 3: 更新3:
Updated the code to print errors to errors.txt 更新了代码以将错误打印到errors.txt
import argparse
from tqdm import tqdm
from sys import argv
from pprint import pformat
from twisted.internet.task import react
from twisted.web.client import Agent, readBody
from twisted.web.http_headers import Headers
from twisted.internet.task import cooperate
from twisted.internet.defer import gatherResults
import lxml.html
from geoip import geolite2
import pycountry
from tld import get_tld
import json
import socket
poweredby = ""
server = ""
ip = ""
f = open("errors.txt", "w")
def error(response, url):
f.write("Error: "+url+"\n")
def cbRequest(response, url):
global poweredby, server, ip
# print 'Response version:', response.version
# print 'Response code:', response.code
# print 'Response phrase:', response.phrase
# print 'Response headers:'
# print pformat(list(response.headers.getAllRawHeaders()))
poweredby = response.headers.getRawHeaders("X-Powered-By")[0]
server = response.headers.getRawHeaders("Server")[0]
#print poweredby
#print server
d = readBody(response)
d.addCallback(cbBody, url)
return d
def cbBody(body, ourl):
global poweredby, server,ip
#print body
html_element = lxml.html.fromstring(body)
generator = html_element.xpath("//meta[@name='generator']/@content")
ip = socket.gethostbyname(ourl)
try:
match = geolite2.lookup(ip)
if match is not None:
country = match.country
try:
c = pycountry.countries.lookup(country)
country = c.name
except:
country = ""
except:
country = ""
try:
res = get_tld("http://www" + ourl, as_object=True)
tld = res.suffix
except:
tld = ""
try:
match = re.search(r'[\w\.-]+@[\w\.-]+', body)
email = match.group(0)
except:
email = ""
permalink=ourl.rstrip().replace(".","-")
try:
item = generator[0]
val = "{ \"Domain\":" + json.dumps(
"http://" + ourl.rstrip()) + ",\"IP\":\"" + ip + "\",\"Server\":" + json.dumps(
str(server)) + ",\"PoweredBy\":" + json.dumps(
str(poweredby)) + ",\"MetaGenerator\":" + json.dumps(item) + ",\"Email\":" + json.dumps(
email) + ",\"Suffix\":\"" + tld + "\",\"CountryHosted\":\"" + country+"\",\"permalink\":\""+permalink+"\" }"
except:
val = "{ \"Domain\":" + json.dumps(
"http://" + ourl.rstrip()) + ",\"IP\":\"" + ip + "\"," + "\"Server\":" + json.dumps(
str(server)) + ",\"PoweredBy\":" + json.dumps(
str(poweredby)) + ",\"MetaGenerator\":\"\",\"Email\":" + json.dumps(
email) + ",\"Suffix\":\"" + tld + "\",\"CountryHosted\":\"" + country+"\",\"permalink\":\""+permalink+"\" }"
print val
def main(reactor, url_path):
urls = open(url_path)
return mainjob(reactor, (url.strip() for url in urls))
def mainjob(reactor, urls=argv[2:]):
#for url in urls:
# print url
agent = Agent(reactor)
work = (process(agent, url) for url in tqdm(urls))
tasks = list(cooperate(work) for i in range(100))
return gatherResults(list(task.whenDone() for task in tasks))
def process(agent, url):
d = agent.request(
'GET', "http://" + url,
Headers({'User-Agent': ['crawler']}),
None)
d.addCallback(cbRequest, url)
d.addErrback(error, url)
return d
react(main, ["./domains.txt"])
f.close()
Update 4: 更新4:
I captured the traffic with Wireshark, just with 2 domains, those domains errored previously: 我使用Wireshark捕获了流量,仅包含2个域,这些域以前出错:
user@laptop:~/crawler$ python scanner.py
2it [00:00, 840.71it/s]
user@laptop:~/crawler$ cat errors.txt
Error: google.al
Error: fau.edu.al
As you can see they had errors, but with Wireshark I see the response: 如您所见,它们有错误,但是通过Wireshark,我看到了响应:
You need to add a limit to the amount of concurrency your program creates. 您需要为程序创建的并发数量增加一个限制。 Currently, you process all URLs given at the same time - or try to, at least:
当前,您要同时处理所有给定的URL-或尝试至少:
def mainjob(reactor, urls=urls):
for url in tqdm(urls):
agent = Agent(reactor)
d = agent.request(
'GET', "http://" + url,
Headers({'User-Agent': ['bot']}),
None)
d.addCallback(cbRequest, url)
d.addErrback(lambda x: None) # ignore errors
return d
This issues a request for each URL without waiting for any of them to complete. 这将为每个URL发出请求,而无需等待它们中的任何一个完成。 Instead, use
twisted.internet.task.cooperate
to run a limited number at a time. 而是使用
twisted.internet.task.cooperate
一次运行有限数量。 This runs one request at a time: 一次运行一个请求:
def mainjob(reactor, urls):
agent = Agent(reactor)
work = (process(agent, url) for url in tqdm(urls))
task = cooperate(work)
return task.whenDone()
def process(agent, url):
d = agent.request(
'GET', "http://" + url,
Headers({'User-Agent': ['bot']}),
None)
d.addCallback(cbRequest, url)
d.addErrback(lambda x: None) # ignore errors
return d
You probably want more than that. 您可能还想要更多。 So, call cooperate() a few more times:
因此,再调用一次combinate():
def mainjob(reactor, urls=urls):
agent = Agent(reactor)
work = (process(agent, url) for url in tqdm(urls))
tasks = list(cooperate(work) for i in range(100))
return gatherResults(list(task.whenDone() for task in tasks))
This runs up to 100 requests at a time. 一次最多运行100个请求。 Each task pulls the next element from
work
and waits on it. 每个任务从
work
提取下一个元素,然后等待它。 gatherResults
waits for all 100 tasks to finish. gatherResults
等待所有100个任务完成。
Now just avoid loading the complete input into memory at a time: 现在,只需避免一次将完整的输入加载到内存中:
def main(reactor, url_path):
urls = open(url_path)
return mainjob(reactor, (url.strip() for url in urls))
react(main, ["path-to-urls.txt"])
This opens the url file but only reads lines from it as they're needed. 这将打开url文件,但仅在需要时才从其中读取行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.