简体   繁体   中英

Can't extract the text and find all by BeautifulSoup

I want to extract the all the available items in the équipements, but I can only get the first four items, and then I got '+ plus'.

import urllib2
from bs4 import BeautifulSoup
import re
import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
url = 'https://www.airbnb.fr/rooms/8261637?s=bAMrFL5A'
req = urllib2.Request(url = url, headers = headers)
html = urllib2.urlopen(req)
bsobj = BeautifulSoup(html.read(),'lxml')
b = bsobj.findAll("div",{"class": "row amenities"})

for the result of b, it does not return all the list inside the tag. And for the last one of it is '+ plus', looks like as following.

<span data-reactid=".mjeft4n4sg.0.0.0.0.1.8.1.0.0.$1.1.0.0">+ Plus</span></strong></a></div></div></div></div></div>]

This is because data filled up using reactjs after page load. So if you download it via requests you can't see the data.

Instead you have to use selenium web driver , open page and process all the javascripts. Then you can get ccess to all data you expect

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM