简体   繁体   English

无法从Airbnb抓取所有HTML

[英]Can't scrape all HTML from Airbnb

I'm learning to scrape and am trying it out on Airbnb ( here's the page ). 我正在学习刮擦并在Airbnb上进行尝试( 这是页面 )。 When I inspect one of the home images using Google Chrome, I see this: 当我使用Google Chrome浏览器检查一张家庭图片时,看到以下内容: 在此处输入图片说明

I can't get my script to return the HTML that represents the stuff pictured (eg the link to the listing). 我无法让我的脚本返回表示图片所示内容的HTML(例如,清单链接)。 Initial attempt: 初步尝试:

import requests    

url = "https://www.airbnb.co.uk/s/Rome/homes?checkin=2017-11-12&checkout=2017-11-19"
landing = requests.get(url)

print landing.content.find("rooms/")

That just returns a -1 (ie rooms/ isn't in the HTML). 那只是返回-1 (即rooms/不在HTML中)。

Then some research threw up ideas about 'headers', so that Airbnb doesn't know I'm a script (the code is copy/pasted as I don't really get what these headers do). 然后,一些研究提出了有关“标题”的想法,以使Airbnb不知道我是一个脚本(由于我没有真正了解这些标题的作用,因此代码已被复制/粘贴)。 Someone else suggested using urllib instead. 有人建议改用urllib。 So the latest attempt is: 因此,最近的尝试是:

from urllib2 import Request,urlopen

user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
headers = { 'User-Agent' : user_agent }

url = "https://www.airbnb.co.uk/s/Rome/homes?checkin=2017-11-12&checkout=2017-11-19"

req = Request(url,None,headers)
landing = urlopen(req)
print landing.read().find('rooms/')

This also returns a -1. 这也将返回-1。

Any idea is much appreciated. 任何想法都非常感谢。 I'm using Python 2.7 (Windows). 我正在使用Python 2.7(Windows)。

It happens because request doesn't run Javascript code. 发生这种情况是因为request未运行Javascript代码。 As a result you can't find rooms/ . 结果,您找不到rooms/ You could use Selenium or Splash. 您可以使用Selenium或Splash。

If you open page source and try to find rooms/ you will find no results either. 如果您打开页面源并尝试查找rooms/您也将找不到任何结果。

This happens because the content is only loaded into your browser window by javascript after the initial request has finished. 发生这种情况的原因是,初始请求完成后,内容仅通过javascript加载到浏览器窗口中。 Basically, this is because of the way Airbnb is populating the DOM of their pages. 基本上,这是因为Airbnb填充其页面DOM的方式。

In order to be able to scrape such pages, you will need more advanced tricks than simple requests, I'm afraid. 为了能够抓取此类页面,恐怕您将需要比简单请求更多的高级技巧。

Two tips, if you're a beginner: 如果您是初学者,请注意以下两个提示:

  • start with testing on simple websites, perhaps best static sites, if you can find any interesting ones 从在简单的网站(也许是最好的静态网站)上进行测试开始,如果您发现任何有趣的网站
  • don't go for Python 2. Python 3 has been out for a long time now, so best to get started with that right away. 不要选择Python2。Python3已经存在很长时间了,因此最好立即开始使用它。

Good luck! 祝好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM