[英]Go to next page on showthread.php with scrapy
I'm new to scrapy. 我是新手。 For about 4 days I'm stuck at go to next page when fetching showthread.php (forum based on vbulletin).
在获取showthread.php(基于vbulletin的论坛)时,我停留在大约4天的时间。
My target: http://forum.femaledaily.com/showthread.php?359-Hair-Smoothing 我的目标: http : //forum.femaledaily.com/showthread.php?359-Hair-Smoothing
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from femaledaily.items import FemaledailyItem
class Femaledaily(scrapy.Spider):
name = "femaledaily"
allowed_domains = ["femaledaily.com"]
start_urls = [
"http://forum.femaledaily.com/forumdisplay.php?136-Hair-Care",
"http://forum.femaledaily.com/forumdisplay.php?136-Hair-Care/page2",
"http://forum.femaledaily.com/forumdisplay.php?136-Hair-Care/page3",
"http://forum.femaledaily.com/forumdisplay.php?136-Hair-Care/page4",
]
def parse(self, response):
for thd in response.css("tbody > tr "):
print "==========NEW THREAD======"
url = thd.xpath('.//div[@class="threadlist-title"]/a/@href').extract()
url[0] = "http://forum.femaledaily.com/"+url[0]
print url[0]
yield scrapy.Request(url[0], callback=self.parse_thread)
def parse_thread(self, response):
for page in response.xpath('//ol[@id="posts"]/li'):
item = FemaledailyItem()
item['thread_title'] = response.selector.xpath('//span[@class="threadtitle"]/a/text()').extract()
# item['thread_starter'] = response.selector.xpath('//div[@class="username_container"]/a/text()').extract_first()
post_creator = page.xpath('.//div[@class="username_container"]/a/text()').extract()
if not post_creator:
item['post_creator'] = page.xpath('.//div[@class="username_container"]/a/span/text()').extract()
else:
item['post_creator'] = post_creator
item['post_content'] = ""
cot = page.xpath(".//blockquote[@class='postcontent restore ']/text()").extract()
for ct in cot:
item['post_content'] += ct.replace('\t','').replace('\n','')
yield item
I'm able to get first 10 posts for every thread, but I'm confused how to go to next page. 我能够为每个主题获得前10个帖子,但是我对如何转到下一页感到困惑。 Any ideas?
有任何想法吗?
A slight change made in your code so that it will paginate properly, 对您的代码进行了细微的更改,以便正确分页,
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from femaledaily.items import FemaledailyItem
class Femaledaily(scrapy.Spider):
name = "femaledaily"
allowed_domains = ["femaledaily.com"]
BASE_URL = "http://forum.femaledaily.com/"
start_urls = [
"http://forum.femaledaily.com/forumdisplay.php?136-Hair-Care",
"http://forum.femaledaily.com/forumdisplay.php?136-Hair-Care/page2",
"http://forum.femaledaily.com/forumdisplay.php?136-Hair-Care/page3",
"http://forum.femaledaily.com/forumdisplay.php?136-Hair-Care/page4",
]
def parse(self, response):
for thd in response.css("tbody > tr "):
print "==========NEW THREAD======"
url = thd.xpath('.//div[@class="threadlist-title"]/a/@href').extract()
url = "http://forum.femaledaily.com/"+url[0]
yield scrapy.Request(url, callback=self.parse_thread)
# pagination
next_page = response.xpath('//li[@class="prev_next"]/a[@rel="next"]/@href').extract()
if next_page:
yield Request(self.BASE_URL + next_page[0], callback=self.parse)
else:
return
def parse_thread(self, response):
for page in response.xpath('//ol[@id="posts"]/li'):
item = FemaledailyItem()
item['thread_title'] = response.selector.xpath('//span[@class="threadtitle"]/a/text()').extract()
# item['thread_starter'] = response.selector.xpath('//div[@class="username_container"]/a/text()').extract_first()
post_creator = page.xpath('.//div[@class="username_container"]/a/text()').extract()
if not post_creator:
item['post_creator'] = page.xpath('.//div[@class="username_container"]/a/span/text()').extract()
else:
item['post_creator'] = post_creator
item['post_content'] = ""
cot = page.xpath(".//blockquote[@class='postcontent restore ']/text()").extract()
for ct in cot:
item['post_content'] += ct.replace('\t','').replace('\n','')
yield item
# pagination
next_page = response.xpath('//li[@class="prev_next"]/a[@rel="next"]/@href').extract()
if next_page:
yield Request(self.BASE_URL + next_page[0], callback=self.parse_thread)
else:
return
Here first extract the next page's link (ie, single forward arrow) and giving a request to that next_page_url
and make the callback function as the same function from where it is called. 在这里,首先提取下一页的链接(即,单个前进箭头)并向
next_page_url
发出请求,并使回调函数与调用该函数的位置相同。 When it reaches the last page the next-page-url
vanishes and halts. 当到达最后一页时,
next-page-url
消失并停止。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.