简体   繁体   中英

Extracting data from an inconsistent HTML page using BeautifulSoup4 and Python

I'm trying to extract data from this webpage and I'm having some trouble due to inconsistancies within the page's HTML formatting. I have a list of OGAP IDs and I want to extract the Gene Name and any literature information (PMID #) for each OGAP ID I iterate through. Thanks to other questions on here, and the BeautifulSoup documentation, I've been able to consistantly get the gene name for each ID, but I'm having trouble with the literature part. Here's a couple search terms that highlight the inconsitancies.

HTML sample that works

Search term: OG00131

 <tr> <td colspan="4" bgcolor="#FBFFCC" class="STYLE28">Literature describing O-GlcNAcylation: <br>&nbsp;&nbsp;PMID: <a href="http://www.ncbi.nlm.nih.gov/pubmed/20068230">20068230</a> [CAD, ETD MS/MS]; <br> <br> </td> </tr> 

HTML sample that doesn't work

Search term: OG00020

 <td align="top" bgcolor="#FBFFCC"> <div class="STYLE28">Literature describing O-GlcNAcylation: </div> <div class="STYLE28"> <div class="STYLE28">PMID: <a href="http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation">16408927</a> [Azide-tag, nano-HPLC/tandem MS] </div> <br> Site has not yet been determined. Use <a href="parser2.cgi?ACLY_HUMAN" target="_blank">OGlcNAcScan</a> to predict the O-GlcNAc site. </div> </td> 

Here's the code I have so far

import urllib2
from bs4 import BeautifulSoup

#define list of genes

#initialize variables
gene_list = []
literature = []
# Test list
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]


for i in range(len(gene_listID)):
    print gene_listID[i]
    # Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided
    dbOGAP = "https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield=%s&select=Any" % gene_listID[i]
    # Opens the URL as a page
    page = urllib2.urlopen(dbOGAP)
    # Reads the page and parses it through "lxml" format
    soup = BeautifulSoup(page, "lxml")

    gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text
    print gene_name[1:]
    gene_list.append(gene_name[1:])

    # PubMed IDs are located near the <td> tag with the term "Data and Source"
    pmid = soup.find("span", text="Data and Source")

    # Based on inspection of the website, need to move up to the parent <td> tag
    pmid_p = pmid.parent

    # Then we move to the next <td> tag, denoted as sibling (since they share parent <tr> (Table row) tag)
    pmid_s = pmid_p.next_sibling
    #for child in pmid_s.descendants:
     #   print child
    # Now we search down the tree to find the next table data (<td>) tag
    pmid_c = pmid_s.find("td")
    temp_lit = []
    # Next we print the text of the data
    #print pmid_c.text
    if "No literature is available" in pmid_c.text:
        temp_lit.append("No literature is available")
        print "Not available"
    else:
    # and then print out a list of urls for each pubmed ID we have
        print "The following is available"
        for link in pmid_c.find_all('a'):
            # the <a> tag includes more than just the link address.
            # for each <a> tag found, print the address (href attribute) and extra bits
            # link.string provides the string that appears to be hyperlinked.
            # In this case, it is the pubmedID
            print link.string
            temp_lit.append("PMID: " + link.string + "  URL: " + link.get('href'))
    literature.append(temp_lit)
    print "\n"

So it seems the element is what is throwing the code for a loop. Is there a way to search for any element with the text "PMID" and return the text that comes after it (and url if there is a PMID number)? If not, would I just want to check each child, looking for the text I'm interested in?

I'm using Python 2.7.10

import requests
from bs4 import BeautifulSoup
import re
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
urls = ('https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield={}&select=Any'.format(i) for i in gene_listID)

for url in urls: 
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    regex = re.compile(r'http://www.ncbi.nlm.nih.gov/pubmed/\d+')

    a_tag = soup.find('a', href=regex)
    has_pmid = 'PMID' in a_tag.previous_element

    if has_pmid :
        print(a_tag.text, a_tag.next_sibling, a_tag.get("href"))
    else:
        print("Not available")

out:

18984734  [GalNAz-Biotin tagging, CAD MS/MS];  http://www.ncbi.nlm.nih.gov/pubmed/18984734
20068230  [CAD, ETD MS/MS];  http://www.ncbi.nlm.nih.gov/pubmed/20068230
20068230  [CAD, ETD MS/MS];  http://www.ncbi.nlm.nih.gov/pubmed/20068230
Not available
16408927  [Azide-tag, nano-HPLC/tandem MS];   http://www.ncbi.nlm.nih.gov/pubmed/16408927
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS]  http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation

find the first a tag that match the target url, which end with numbers, than check if 'PMID' in it's previous element. this web is so inconsistancies , and i try many times, hope this would help

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM