简体   繁体   English

用户评论的数据提取

[英]Data extraction for user reviews

I am trying to learn R out of my personal self learning interest. 我试图从个人自学的兴趣中学习R. Neither a coder nor an analyst. 既不是编码员也不是分析师。 I wanted to extract user reviews from Trip Advisor. 我想从Trip Advisor中提取用户评论。 In single page we have 10 reviews, but using below codes i am getting unwanted reviews/lines as well. 在单页我们有10条评论,但使用下面的代码,我也收到不需要的评论/行。 I am not sure if i am using the correct html node. 我不确定我是否使用正确的html节点。 Moreover, i want to extract full review of a user but its ending giving me the partial reviews of a user. 此外,我想提取用户的完整评论,但其结尾给我一个用户的部分评论。 Can you please help me in extracting the full user reviews of count 10? 你能帮我提一下10条的完整用户评论吗? Thank you so much for your help. 非常感谢你的帮助。

  dat <- readLines("http://www.tripadvisor.in/Hotel_Review-g60763-d93450-Reviews-Grand_Hyatt_New_York-New_York_City_New_York.html", warn=FALSE)
  raw2 <- htmlTreeParse(dat, useInternalNodes = TRUE)
  ##User Review
  plain.text <- xpathSApply(raw2, "//div[@class='col2of2']//p[@class='partial_entry']", xmlValue)
  UR <-gsub("\\\n","",plain.text)
  Result <- unlist(UR)
  Result

This is much more an exercise in web-scraping than R programming. 与R编程相比,这更像是一种网络抓取练习。

In R, I prefer the httr package to grab the http response and extract the content as parsed html. 在R中,我更喜欢httr包来获取http响应并将内容解析为解析后的html。 Using readLines(...) is just about the worst way to do this. 使用readLines(...)只是最糟糕的方法。 So the code below will extract the review summaries. 因此,下面的代码将提取审阅摘要。

library(httr)
library(XML)
url <- "http://www.tripadvisor.in/Hotel_Review-g60763-d93450-Reviews-Grand_Hyatt_New_York-New_York_City_New_York.html"
response <- GET(url)
doc      <- content(response,type="text/html")
smry     <- xpathSApply(doc,'//div[@class="entry"]/p[@class="partial_entry"]',xmlValue)
length(smry)
# [1] 10
smry[1]
# [1] "\nThats all that matters really...I wonder if anyone would chose this hotel for any other factor at all...located right next to Grand central station in midtown and within walking distance of many tourist attractions, top restaurants and corp offices. Stayed 3 nights here on a business trip, I chose this hotel over others purely based on its location. Price is...\n\n\nMore  \n\n"

Getting the full reviews is more complicated, because it involves clicking on the "More" button. 获得完整的评论更复杂,因为它涉及点击“更多”按钮。 So you need to determine which http requests are fired when you click the "More" link on a reference. 因此,当您单击引用上的“更多”链接时,需要确定触发了哪些http请求。 You can do this using the Network Monitor tab in Firefox's developer tools (or many other tools, I'm sure). 您可以使用Firefox的开发人员工具(或许多其他工具,我确定)中的网络监视器选项卡执行此操作。 It turns out that this is a link of the form: 事实证明,这是一个形式的链接:

http://www.tripadvisor.com/ExpandedUserReviews-g{xxx}-d{yyy}?querystring

where {xxx} and {yyy} are unique to the hotel and are the same as in the original url, and querystring is fully identified in the Network Monitor tool. 其中{xxx}{yyy}对于酒店是唯一的并且与原始URL中的相同,并且在网络监视器工具中完全标识了querystring So we form a new http request with that url and the appropriate query string and parse the result, as below. 因此,我们使用该URL和相应的查询字符串形成一个新的http请求并解析结果,如下所示。

cls   <- doc['//div[@class="entry"]//span[contains(@class,"moreLink")]/@class']
xr.refno <- sapply(cls,function(x)sub(".*\\str(\\d+)\\s.*","\\1",x))
code     <- sub(".*Hotel_Review(\\-g\\d+\\-d\\d+)\\-Reviews.*","\\1",url)
xr.url   <- paste0("http://www.tripadvisor.com/ExpandedUserReviews",code)
xr.response <- GET(xr.url,query=list(target=xr.refno[1],
                                     context=1,
                                     reviews=paste(xr.refno,collapse=","),
                                     servlet="Hotel_Review",
                                     expand=1))
xr.doc   <- content(xr.response,type="text/html")
xr.full  <- xpathSApply(xr.doc,'//div[@class="entry"]/p',xmlValue)
length(xr.full)
# [1] 6
xr.full[1]
# [1] "\nThats all that matters really...I wonder if anyone would chose this hotel for any other factor at all...located right next to Grand central station in midtown and within walking distance of many tourist attractions, top restaurants and corp offices. Stayed 3 nights here on a business trip, I chose this hotel over others purely based on its location. Price is about average in NYC I think. Asked for a room with a good view and was given a 2 BR on the 30th floor. After checking in I realized there may not be the kind of view that I expected at all from any room in this hotel - due to it being surrounded by high rises in all directions. However, no other complaints as such - except may that the bathroom was a bit too cramped. That I guess is the norm in NYC. I would stay here again if it was a business visit based on the location. Faster than avg wifi (free) was a good plus.\n"

There is one more nuance/problem. 还有一个细微差别/问题。 Notice that there are only 6 "Expanded Reviews". 请注意,只有6个“扩展评论”。 This is because short reviews, which fit in the "Partial Review" format, do not have a "More" button. 这是因为符合“部分审核”格式的简短评论没有“更多”按钮。 So you'd need to figure out which of the partial reviews are in fact full. 所以你需要弄清楚哪些部分评论实际上是完整的。 Since you say you're learning R, I'll leave that to you... 既然你说你在学习R,我会把它留给你......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM