简体   繁体   English

Scrapy:用同样的规则重新爬行

[英]Scrapy: re-crawling with the same rule

i am using scrapy to extract reviews from tripAdvisor.我正在使用scrapy从tripAdvisor中提取评论。

my start_urls are the hotels.我的 start_urls 是酒店。 for example:例如:

http://www.tripadvisor.com/Hotel_Review-g60763-d80075-Reviews-Amsterdam_Court_Hotel-New_York_City_New_York.html#REVIEWS.

from this pages i am crawling to the reviews page, using this rules:我从这个页面爬到评论页面,使用以下规则:

rules = (
    Rule(SgmlLinkExtractor(allow=("ShowUserReviews-g.*",), restrict_xpaths=('//*[@id="REVIEWS"]/div[4]/div/div[2]/div/div/div[1]/a',), unique=True), callback='parse_item', follow= True),

    Rule(SgmlLinkExtractor(allow=("ShowUserReviews-g.*",),restrict_xpaths=('//*[@id="REVIEWS"]/div[contains(@class,"deckTools btm")]',),unique=True),callback='parse_item',follow=True),
)

an example for reviews page:评论页面示例:

http://www.tripadvisor.com/ShowUserReviews-g187514-d228523-r275442835-Hotel_Petit_Palace_Arturo_Soria-Madrid.html#REVIEWS

at the end of every reviews page there are links for the next reviews pages for this hotel, numbered like : 1, 2, 3, 4 ..... i can use the same rule i think, the next addresses are similar.在每个评论页面的末尾都有该酒店下一个评论页面的链接,编号如下:1、2、3、4 ..... 我可以使用我认为相同的规则,下一个地址是相似的。

follow this printscreen:按照这个打印屏幕:

http://s16.postimg.org/w68m82ouc/Screenshot_from_2015_07_02_12_36_03.jpg

my questions:我的问题:

  1. how the rule crawling works?规则抓取如何工作? the scraper can re-crawling to the next reviews pages with the same rule?刮板可以使用相同的规则重新爬到下一个评论页面吗? i need something else?我需要别的东西吗?

  2. how can i avoid crawling to reviews pages i saw before?我怎样才能避免爬到我以前看到的评论页面? for example, crawl from page no.3 to page no.1 and 2..?例如,从第 3 页爬到第 1 页和第 2 页..?

thanks谢谢

Filter for the "Next" site in the Rule for the next sites.在下一个站点的规则中过滤“下一个”站点。 This avoids visiting already visited review-sites.这避免了访问已经访问过的评论站点。

Rule(SgmlLinkExtractor(allow=("ShowUserReviews-g.*",), restrict_xpaths=('//*[@id="REVIEWS"]/div[4]/div/div[2]/div/div/div[1]/a[text() = "Next"]',), unique=True), callback='parse_item', follow= True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM