Scrapy：用同样的规则重新爬行

Question

i am using scrapy to extract reviews from tripAdvisor.我正在使用scrapy从tripAdvisor中提取评论。

my start_urls are the hotels.我的 start_urls 是酒店。 for example:例如：

http://www.tripadvisor.com/Hotel_Review-g60763-d80075-Reviews-Amsterdam_Court_Hotel-New_York_City_New_York.html#REVIEWS.

from this pages i am crawling to the reviews page, using this rules:我从这个页面爬到评论页面，使用以下规则：

rules = (
    Rule(SgmlLinkExtractor(allow=("ShowUserReviews-g.*",), restrict_xpaths=('//*[@id="REVIEWS"]/div[4]/div/div[2]/div/div/div[1]/a',), unique=True), callback='parse_item', follow= True),

    Rule(SgmlLinkExtractor(allow=("ShowUserReviews-g.*",),restrict_xpaths=('//*[@id="REVIEWS"]/div[contains(@class,"deckTools btm")]',),unique=True),callback='parse_item',follow=True),
)

an example for reviews page:评论页面示例：

http://www.tripadvisor.com/ShowUserReviews-g187514-d228523-r275442835-Hotel_Petit_Palace_Arturo_Soria-Madrid.html#REVIEWS

at the end of every reviews page there are links for the next reviews pages for this hotel, numbered like : 1, 2, 3, 4 ..... i can use the same rule i think, the next addresses are similar.在每个评论页面的末尾都有该酒店下一个评论页面的链接，编号如下：1、2、3、4 ..... 我可以使用我认为相同的规则，下一个地址是相似的。

follow this printscreen:按照这个打印屏幕：

http://s16.postimg.org/w68m82ouc/Screenshot_from_2015_07_02_12_36_03.jpg

my questions:我的问题：

how the rule crawling works?规则抓取如何工作？ the scraper can re-crawling to the next reviews pages with the same rule?刮板可以使用相同的规则重新爬到下一个评论页面吗？ i need something else?我需要别的东西吗？
how can i avoid crawling to reviews pages i saw before?我怎样才能避免爬到我以前看到的评论页面？ for example, crawl from page no.3 to page no.1 and 2..?例如，从第 3 页爬到第 1 页和第 2 页..？

thanks谢谢

Answer 1

Filter for the "Next" site in the Rule for the next sites.在下一个站点的规则中过滤“下一个”站点。 This avoids visiting already visited review-sites.这避免了访问已经访问过的评论站点。

Rule(SgmlLinkExtractor(allow=("ShowUserReviews-g.*",), restrict_xpaths=('//*[@id="REVIEWS"]/div[4]/div/div[2]/div/div/div[1]/a[text() = "Next"]',), unique=True), callback='parse_item', follow= True)

Scrapy：用同样的规则重新爬行

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-07-02 13:23:22

Scrapy：用同样的规则重新爬行

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-07-02 13:23:22

解决方案1
1 已采纳 2015-07-02 13:23:22