简体   繁体   English

通过在python中使用漂亮的汤和硒来解析html

[英]parsing html by using beautiful soup and selenium in python

I wanted to practice scraping with a real world example (Airbnb) by using BeautifulSoup and Selenium in python. 我想通过在Python中使用BeautifulSoup和Selenium来练习一个真实示例(Airbnb)的抓取。 Specifically, my goal is to get all the listings(homes)ID within LA. 具体来说,我的目标是获取洛杉矶内的所有房屋(房屋)ID。 My strategy is to open a chrome and go to Airbnb website where I already manually searched homes in LA and starts from here. 我的策略是打开一个浏览器,然后转到Airbnb网站,在那里我已经手动搜索了洛杉矶的房屋,并从这里开始。 Up to this process, I decided to use selenium. 在此过程之前,我决定使用硒。 After that I wanted to parse HTML codes inside of source codes and then find listing IDs that are shown at a current page. 之后,我想在源代码中解析HTML代码,然后找到当前页面上显示的列表ID。 Then basically, wanted to just iterate through all the pages. 然后,基本上,只想遍历所有页面。 Here's my codes: 这是我的代码:

from urllib import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver

option=webdriver.ChromeOptions()
option.add_argument("--incognito")

driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",chrome_options=option)

first_url="https://www.airbnb.com/s/Los-Angeles--CA--United-States/select_homes?refinement_paths%5B%5D=%2Fselect_homes&place_id=ChIJE9on3F3HwoAR9AhGJW_fL-I&children=0&guests=1&query=Los%20Angeles%2C%20CA%2C%20United%20States&click_referer=t%3ASEE_ALL%7Csid%3Afcf33cf1-61b8-41d5-bef1-fbc5d0570810%7Cst%3AHOME_GROUPING_SELECT_HOMES&superhost=false&title_type=SELECT_GROUPING&allow_override%5B%5D=&s_tag=tm-X8bVo"
n=3

for i in range(1,n+1):
    if (i==1):
        driver.get(first_url)
        print first_url
        #HTML parse using BS
        html =driver.page_source
        soup=BeautifulSoup(html,"html.parser")
        listings=soup.findAll("div",{"class":"_f21qs6"})

        #print out all the listing_ids within a current page
        for i in range(len(listings)):
            only_id= listings[i]['id']
            print(only_id[8:])

    after_first_url=first_url+"&section_offset=%d" % i
    print after_first_url
    driver.get(after_first_url)
    #HTML parse using BS
    html =driver.page_source
    soup=BeautifulSoup(html,"html.parser")
    listings=soup.findAll("div",{"class":"_f21qs6"})

    #print out all the listing_ids within a current page
    for i in range(len(listings)):
        only_id= listings[i]['id']
        print(only_id[8:])

If you find any inefficient codes, please understand since I'm a beginner. 如果您发现任何效率低下的代码,请谅解,因为我是新手。 I made this codes by reading and watching multiple sources. 我通过阅读和观看多个资源来编写此代码。 Anyway, I guess I have correct codes but the issue is that every time I run this, I get a different result. 无论如何,我猜我有正确的代码,但是问题是,每次运行此代码,都会得到不同的结果。 What it means is that it loops over pages, but sometimes it gives the results for only certain number of pages. 这意味着它循环遍历页面,但是有时它仅给出特定数量的页面的结果。 For example, it loops page1 but doesn't give any corresponding output and loops page2 and gives results but doesn't for page3. 例如,它循环page1但不给出任何相应的输出,而循环page2并给出结果,但不提供page3。 Its' so random that it gives results for some pages but doesn't for some other pages. 它是如此随机,以至于为某些页面提供了结果,但对于其他页面却没有。 On top of that, sometimes it loops page1,2,3, ... in an order, but sometimes it loops page1 and then move on to the last page (17) and then come back to page2. 最重要的是,有时它会按顺序循环page1,2,3,...,但有时会循环page1,然后移至最后一页(17),然后返回至page2。 I guess my codes are not perfect since it gives unstable outputs. 我猜我的代码并不完美,因为它提供了不稳定的输出。 Did anyone have similar experience or could someone help me out what the problem is? 有没有人有类似的经验或有人可以帮助我解决问题所在? Thanks. 谢谢。

Try below method 请尝试以下方法

Assuming you are on the page you want to parse, Selenium stores the source HTML in the driver's page_source attribute. 假设您在要解析的页面上,Selenium将源HTML存储在驱动程序的page_source属性中。 You would then load the page_source into BeautifulSoup as follows: 然后,您可以按以下方式将page_source加载到BeautifulSoup中:

In [8]: from bs4 import BeautifulSoup

In [9]: from selenium import webdriver

In [10]: driver = webdriver.Firefox()

In [11]: driver.get('http://news.ycombinator.com')

In [12]: html = driver.page_source

In [13]: soup = BeautifulSoup(html)

In [14]: for tag in soup.find_all('title'):
   ....:     print tag.text
   ....:     
   ....:     
Hacker News

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM