简体   繁体   中英

parsing html by using beautiful soup and selenium in python

I wanted to practice scraping with a real world example (Airbnb) by using BeautifulSoup and Selenium in python. Specifically, my goal is to get all the listings(homes)ID within LA. My strategy is to open a chrome and go to Airbnb website where I already manually searched homes in LA and starts from here. Up to this process, I decided to use selenium. After that I wanted to parse HTML codes inside of source codes and then find listing IDs that are shown at a current page. Then basically, wanted to just iterate through all the pages. Here's my codes:

from urllib import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver

option=webdriver.ChromeOptions()
option.add_argument("--incognito")

driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",chrome_options=option)

first_url="https://www.airbnb.com/s/Los-Angeles--CA--United-States/select_homes?refinement_paths%5B%5D=%2Fselect_homes&place_id=ChIJE9on3F3HwoAR9AhGJW_fL-I&children=0&guests=1&query=Los%20Angeles%2C%20CA%2C%20United%20States&click_referer=t%3ASEE_ALL%7Csid%3Afcf33cf1-61b8-41d5-bef1-fbc5d0570810%7Cst%3AHOME_GROUPING_SELECT_HOMES&superhost=false&title_type=SELECT_GROUPING&allow_override%5B%5D=&s_tag=tm-X8bVo"
n=3

for i in range(1,n+1):
    if (i==1):
        driver.get(first_url)
        print first_url
        #HTML parse using BS
        html =driver.page_source
        soup=BeautifulSoup(html,"html.parser")
        listings=soup.findAll("div",{"class":"_f21qs6"})

        #print out all the listing_ids within a current page
        for i in range(len(listings)):
            only_id= listings[i]['id']
            print(only_id[8:])

    after_first_url=first_url+"&section_offset=%d" % i
    print after_first_url
    driver.get(after_first_url)
    #HTML parse using BS
    html =driver.page_source
    soup=BeautifulSoup(html,"html.parser")
    listings=soup.findAll("div",{"class":"_f21qs6"})

    #print out all the listing_ids within a current page
    for i in range(len(listings)):
        only_id= listings[i]['id']
        print(only_id[8:])

If you find any inefficient codes, please understand since I'm a beginner. I made this codes by reading and watching multiple sources. Anyway, I guess I have correct codes but the issue is that every time I run this, I get a different result. What it means is that it loops over pages, but sometimes it gives the results for only certain number of pages. For example, it loops page1 but doesn't give any corresponding output and loops page2 and gives results but doesn't for page3. Its' so random that it gives results for some pages but doesn't for some other pages. On top of that, sometimes it loops page1,2,3, ... in an order, but sometimes it loops page1 and then move on to the last page (17) and then come back to page2. I guess my codes are not perfect since it gives unstable outputs. Did anyone have similar experience or could someone help me out what the problem is? Thanks.

Try below method

Assuming you are on the page you want to parse, Selenium stores the source HTML in the driver's page_source attribute. You would then load the page_source into BeautifulSoup as follows:

In [8]: from bs4 import BeautifulSoup

In [9]: from selenium import webdriver

In [10]: driver = webdriver.Firefox()

In [11]: driver.get('http://news.ycombinator.com')

In [12]: html = driver.page_source

In [13]: soup = BeautifulSoup(html)

In [14]: for tag in soup.find_all('title'):
   ....:     print tag.text
   ....:     
   ....:     
Hacker News

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM