[英]Python Scrapy gets stuck after 40k requests
My scrapy spider is getting stuck after 40k requests.我的 scrapy 蜘蛛在 40k 请求后卡住了。
I am new to scrapy.我是 scrapy 的新手。 Searching around, I wonder if the problem has to do with using the default parse
method name and start_urls
.环顾四周,我想知道问题是否与使用默认的parse
方法名称和start_urls
有关。
I am using custom_settings
to speed things up.我正在使用custom_settings
来加快速度。 If a URL doesn't resolve within a few seconds, move on and don't retry.如果 URL 在几秒钟内没有解决,请继续前进,不要重试。
Spider code:蜘蛛代码:
import scrapy, re, pandas as pd, os, logging
from scrapy.spiders import CrawlSpider, Rule
from ..items import SfscrapeItem
from scrapy.utils.log import configure_logging
dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, '../resources/scraping_urls_1.1.csv')
df = pd.read_csv(filename, index_col=0)
keywords = ['canidae', 'felidae', 'cat', 'cattle', 'dog', 'donkey', 'goat', 'guinea pig', 'horse', 'pig', 'rabbit']
class sfSpider(scrapy.Spider):
name='sfspider'
custom_settings = {
'DNS_TIMEOUT': 10,
'DOWNLOAD_TIMEOUT':10,
'RETRY_ENABLED': False,
'REDIRECT_MAX_TIMES': 2,
}
configure_logging(install_root_handler=False)
logging.basicConfig(
filename='log.txt',
format='%(levelname)s: %(message)s',
level=logging.INFO
)
#just the first 75k for now...
start_urls = df.url.to_list()[:75000]
def parse(self, response):
response_body = response.body.decode('utf-8')
url = response.url
domain = url.split('/')[2]
item = SfscrapeItem()
item['url'] = url
item['domain'] = domain
item['status'] = response.status
item['matches'] = [str(len(re.findall(keyword, response_body, re.IGNORECASE))) for keyword in keywords]
yield item
Here is the log at the point when it is stuck这是卡住时的日志
INFO: Crawled 40940 pages (at 565 pages/min), scraped 16473 items (at 243 items/min)
INFO: Crawled 40940 pages (at 0 pages/min), scraped 16473 items (at 0 items/min)
INFO: Crawled 40940 pages (at 0 pages/min), scraped 16473 items (at 0 items/min)
INFO: Crawled 40940 pages (at 0 pages/min), scraped 16473 items (at 0 items/min)
INFO: Crawled 40940 pages (at 0 pages/min), scraped 16473 items (at 0 items/min)
INFO: Crawled 40940 pages (at 0 pages/min), scraped 16473 items (at 0 items/min)
And here is the output of stats.get_stats()
and prefs()
after it is stuck.这是卡住后stats.get_stats()
和prefs()
的output。
It looks like You started to receive DNSLookupError
看起来您开始收到DNSLookupError
as you have this downloader/exception_type_count/twisted.internet.DNSLookupError: 5109
in stats on minutes with ...pages (at 0 pages/min), scraped x items (at 0 items/min)
因为你有这个downloader/exception_type_count/twisted.internet.DNSLookupError: 5109
in stats on minutes with ...pages (at 0 pages/min), scraped x items (at 0 items/min)
LogStats extention (module that print INFO: Crawled X pages (at X pages/min), scraped X items (at X items/min)
to log) count only received responses (it doesn't count received exceptions). LogStats 扩展(打印INFO: Crawled X pages (at X pages/min), scraped X items (at X items/min)
记录)仅计算收到的响应(不计算收到的异常)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.