how to scrape deeply embeded links with python beautifulSoup

Question

I'm trying to build a spider/web crawler for academic purposes to grab text from academic publications and append related links to a URL stack. I'm trying to crawl 1 website called 'PubMed'. I can't seem to grab the links I need though. Here is my code with an example page, this page should be representative of others in their database:

 website = 'http://www.ncbi.nlm.nih.gov/pubmed/?term=mtap+prmt'
 from bs4 import BeautifulSoup
 import requests
 r = requests.get(website)
 soup = BeautifulSoup(r.content)

I have broken the html tree down into several variables just for readability so that it can all fit on 1 screen width.

 key_text = soup.find('div', {'class':'grid'}).find('div',{'class':'col twelve_col nomargin shadow'}).find('form',{'id':'EntrezForm'})
 side_column = key_text.find('div', {'xmlns:xi':'http://www.w3.org/2001/XInclude'}).find('div', {'class':'supplemental col three_col last'})
 side_links = side_column.find('div').findAll('div')[1].find('div', {'id':'disc_col'}).findAll('div')[1]

 for link in side_links:
      print link

if you look at the html source code using chrome inspect element there should be several other nested divs with links within 'side_links'. However the above code produces the following error:

 Traceback (most recent call last):
 File "C:/Users/ballbag/Copy/web_scraping/google_search.py", line 22, in <module>
 side_links = side_column.find('div').findAll('div')[1].find('div',      {'id':'disc_col'}).findAll('div')[1]
 IndexError: list index out of range

if you go to the url there is a column on the right called 'related links' containing the urls that I wish to scrape. But I can't seem to get to them. There is a statement saying under the div i am trying to get into and I suspect this has something to do with it. Can anyone help grab these links? I'd really appreciate any pointers

Answer 1

The problem is that the side bar is loaded with an additional asynchronous request.

The idea here would be to:

maintain a web-scraping session using requests.Session
parse the url that is used for getting the side bar
follow that link and get the links from the div with class="portlet_content"

Code:

from urlparse import urljoin

from bs4 import BeautifulSoup
import requests


base_url = 'http://www.ncbi.nlm.nih.gov'
website = 'http://www.ncbi.nlm.nih.gov/pubmed/?term=mtap+prmt'

# parse the main page and grab the link to the side bar
session = requests.Session()
soup = BeautifulSoup(session.get(website).content)

url = urljoin(base_url, soup.select('div#disc_col a.disc_col_ph')[0]['href'])

# parsing the side bar
soup = BeautifulSoup(session.get(url).content)

for a in soup.select('div.portlet_content ul li.brieflinkpopper a'):
    print a.text, urljoin(base_url, a.get('href'))

Prints:

The metabolite 5'-methylthioadenosine signals through the adenosine receptor A2B in melanoma. http://www.ncbi.nlm.nih.gov/pubmed/25087184
Down-regulation of methylthioadenosine phosphorylase (MTAP) induces progression of hepatocellular carcinoma via accumulation of 5'-deoxy-5'-methylthioadenosine (MTA). http://www.ncbi.nlm.nih.gov/pubmed/21356366
Quantitative analysis of 5'-deoxy-5'-methylthioadenosine in melanoma cells by liquid chromatography-stable isotope ratio tandem mass spectrometry. http://www.ncbi.nlm.nih.gov/pubmed/18996776
...
Cited in PMC http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/23265702/citedby/?tool=pubmed

how to scrape deeply embeded links with python beautifulSoup

Question

1 answers

solution1
3 ACCPTED 2015-01-10 00:43:05

how to scrape deeply embeded links with python beautifulSoup

Question

1 answers

solution1 3 ACCPTED 2015-01-10 00:43:05

solution1
3 ACCPTED 2015-01-10 00:43:05