简体   繁体   中英

screen scraping tripadvisor with post request

I'm trying to scrape tripadvisor. Suppose I want to scrape the bad reviews for this particular hotel:

http://www.tripadvisor.com/Hotel_Review-g31441-d224344-Reviews-Hilton_Garden_Inn_Bentonville-Bentonville_Arkansas.html#REVIEWS

I only want "Terrible" category and this selection/filtering should be controlled by a html form. I'm planning to send a post request to submit the form. I originally wanted to use br.submit() from mechanize module but later found out it doesn't support javascript. So I'm hoping to use post request to bypass javascript.

But when I use mechanize to see relevant controls, the radio buttons have the same value. Here's my code:

br = mechanize.Browser()
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open("http://www.tripadvisor.com/Hotel_Review-g31441-d224344-Reviews-Hilton_Garden_Inn_Bentonville-Bentonville_Arkansas.html#REVIEWS")

for f in br.forms():
    print f

Here's the relevant form and controls within it:

 <POST http://www.tripadvisor.com/SortReviews#REVIEWS application/x-www-form-urlencoded
   <RadioControl(segRdo=[on, on, on, on, on])>
   <RadioControl(comRdo=[on, on, on, on, on])>
   <HiddenControl(returnTo=__2F__Hotel__5F__Review__2D__g31441__2D__d224344__2D__Reviews__2D__Hilton__5F__Garden__5F__Inn__5F__Bentonville__2D__Bentonville__5F__Arkansas__2E__html#REVIEWS) (readonly)
   <HiddenControl(filterSegment=0) (readonly)>
   <HiddenControl(filterRating=1) (readonly)>>

So rating is controlled by comRdo control, but the strange thing is that the categories,or the radio buttons have the same value 'on'. Let's see the control properties before and after selecting one of the categories:

before:

control_com=br.form.find_control("comRdo","radio")
print control_com.name,control_com.value,control_com.type
comRdo [] radio

After:

(br.form.find_control("comRdo","radio")).items[4].selected=True
print control_com.name,control_com.value,control_com.type
comRdo ['on'] radio

so after selecting "terrible" category, the control's value is 'on' which would be the same if I had selected any other category.When I printed out items in comRdo control: only 'id's are different, every other property is the same:

<Item name='on' id='com1' id='com1' type='radio' class='radio' value='on' name='comRdo'>
<Item name='on' id='com2' id='com2' type='radio' class='radio' value='on' name='comRdo'>
...

So how does this work ?? How can the server tell which radio button I selected because all of them have the same value?? I prepared the post data and sent it into the request, and as expected it doesn't work. res has the same content as the one without any filtering/post request

form={"comRdo":"on"}
req=mechanize.Request("http://www.tripadvisor.com/Hotel_Review-g31441-d224344-Reviews-Hilton_Garden_Inn_Bentonville-Bentonville_Arkansas.html#REVIEWS",urllib.urlencode(form))
req.add_header('Content-Type','application/x-www-form-urlencoded')
cj.add_cookie_header(req)
res=mechanize.urlopen(req)

And I've also tried the code with other post data:

form={"comRdo":["on","on","on","on","on","*on"]}

or

form={"filterSegment":"0","filterRating":"1"}

Could someone help me out on this ? How does this page work with same-value radio buttons? How can I programmatically filter reviews?? Thanks in advance!


Thanks to Slater Tyranus and Diadara, my following code worked!

form={"returnTo":"__2F__Hotel__5F__Review__2D__g31441__2D__d224344__2D__Reviews__2D__Hilton__5F__Garden__5F__Inn__5F__Bentonville__2D__Bentonville__5F__Arkansas__2E__html#REVIEWS","filterSegment":"0","filterRating":"1"}
url="http://www.tripadvisor.com/SortReviews#REVIEWS"
headers={'content-type':'application/x-www-form-urlencoded'}
r=requests.post(url,data=form)
soup=BeautifulSoup(r.content)

As the other answer points out just look at the network tab to figure out what request is the browser making.In this case your form has more than one element and all of them are required to produce the required page. So you should be using

all these values

comRdo:on
returnTo:__2F__Hotel__5F__Review__2D__g31441__2D__d224344__2D__Reviews__2D__Hilton__5F__Garden__5F__Inn__5F__Bentonville__2D__Bentonville__5F__Arkansas__2E__html#REVIEWS
filterSegment:0
filterRating:1

also you will find that you are actually submitting to the wrong url, have a look at the form's action field or chromes network tab

open networks tab, click on preserve log,click on the link that produces your result,then look at the request to figure out what you should do.

If you want to know how a POST request for a site works in general you should inspect the element in Google Chrome and switch over to the network tab. You'll be able to see your POST request go through.

If you click on that POST request you'll get detail on what information you are actually sending in that POST request.

On a lower level, once you inspect the element you'll notice it's embedded within another element with the following tag:

onclick="document.forms.REVIEW_FILTER_FORM.filterRating.value='1';document.forms.REVIEW_FILTER_FORM.submit();"

Which means that you need to start your search at that onclick method, since that's what's actually happening when you click on the terribly value.

If all you're trying to do is get the data back, you don't need to use any kind of hefty scraping framework. Personally I would suggest using requests and lxml. In requests, the way you should send this post request is:

requests.post(url, data={"filterRating":1})

If you actually want to deal with the javascript on the page, then you should use either Selenium or Casper for headless web browsing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM