I'm trying to web-scrape data from a structure that looks like that:
<div class = "tables">
<div class = "table1">
<div class = "row">
<div class = 'data'>Useful Data</div>
<a href = "url1"
</div>
<div class = "row">
<div class = 'data'>Useful Data</div>
<a href = "url1">
</div>
</div>
<div class = "table2">
<div class = "row">
<div class = 'data'>Useful Data</div>
<a href = "url3"
</div>
<div class = "row">
<div class = 'data'>Useful Data</div>
<a href = "url4">
</div>
</div>
</div>
The data that I want is in the div "data", and also on a some other pages accessible by clicking on the urls. I iterate through the 'tables' using BeautifulSoup, and I'm trying to click on the links with Selenium like so:
tables = soup.find_all('div', class_ = 'tables')
for line in tables:
row = line.find_all('div', class_ = "row")
for element in row:
link = driver.find_element_by_xpath('//a[contains(@href,"href")]')
#some code
In my script, this line
link = driver.find_element_by_xpath('//a[contains(@href,"href")]')
always return the first url, when I want it to 'follow' BeautifulSoup and return to following hrefs. So is there a way to modify href depending on the url from the source code? I should add that all my urls are pretty similiar, except for the last part. (ex.: url1 = questions/ask/ 1000 , url2 = questions/ask/ 1001 )
I've also tried to find all the href in the page to iterate trough them using
links = self.driver.find_element_by_xpath('//a[@href]')
but that doesn't work either. Since the page contains a lot of links that aren't useful to me, I'm not sure if that's the best way to go.
Seems to be a bit complicated - Why not extracting the href
with BeautifulSoup
directly?
for a in soup.select('.tables a[href]'):
link = a['href']
You also can modify it, concat with baseUrl and store in a list to iterate over:
urls = [baseUrl+a['href'] for a in soup.select('.tables a[href]')]
baseUrl = 'http://www.example.com'
html='''
<div class = "tables">
<div class = "table1">
<div class = "row">
<div class = 'data'>Useful Data</div>
<a href = "/url1"
</div>
<div class = "row">
<div class = 'data'>Useful Data</div>
<a href = "/url1">
</div>
</div>
<div class = "table2">
<div class = "row">
<div class = 'data'>Useful Data</div>
<a href = "/url3"
</div>
<div class = "row">
<div class = 'data'>Useful Data</div>
<a href = "/url4">
</div>
</div>
</div>'''
soup = BeautifulSoup(html,'lxml')
urls = [baseUrl+a['href'] for a in soup.select('.tables a[href]')]
for url in urls:
print(url)#or request the website,....
http://www.example.com/url1
http://www.example.com/url1
http://www.example.com/url3
http://www.example.com/url4
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.