使用BeautifulSoup4和Python從不一致的HTML頁面中提取數據

Question

我正在嘗試從此網頁中提取數據，但由於該頁面的HTML格式不一致，因此遇到了一些麻煩。 我有一個OGAP ID列表，我想提取每個迭代的OGAP ID的基因名稱和任何文獻信息（PMID＃）。 感謝這里的其他問題和BeautifulSoup文檔，我能夠始終如一地獲得每個ID的基因名稱，但是我在文獻部分遇到了麻煩。 以下是幾個搜索字詞，突顯了這些不一致之處。

有效的HTML示例

搜索詞：OG00131

 <tr> <td colspan="4" bgcolor="#FBFFCC" class="STYLE28">Literature describing O-GlcNAcylation: <br>&nbsp;&nbsp;PMID: <a href="http://www.ncbi.nlm.nih.gov/pubmed/20068230">20068230</a> [CAD, ETD MS/MS]; <br> <br> </td> </tr>

無法使用的HTML示例

搜索詞：OG00020

 <td align="top" bgcolor="#FBFFCC"> <div class="STYLE28">Literature describing O-GlcNAcylation: </div> <div class="STYLE28"> <div class="STYLE28">PMID: <a href="http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation">16408927</a> [Azide-tag, nano-HPLC/tandem MS] </div> <br> Site has not yet been determined. Use <a href="parser2.cgi?ACLY_HUMAN" target="_blank">OGlcNAcScan</a> to predict the O-GlcNAc site. </div> </td>

這是我到目前為止的代碼

import urllib2
from bs4 import BeautifulSoup

#define list of genes

#initialize variables
gene_list = []
literature = []
# Test list
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]


for i in range(len(gene_listID)):
    print gene_listID[i]
    # Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided
    dbOGAP = "https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield=%s&select=Any" % gene_listID[i]
    # Opens the URL as a page
    page = urllib2.urlopen(dbOGAP)
    # Reads the page and parses it through "lxml" format
    soup = BeautifulSoup(page, "lxml")

    gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text
    print gene_name[1:]
    gene_list.append(gene_name[1:])

    # PubMed IDs are located near the <td> tag with the term "Data and Source"
    pmid = soup.find("span", text="Data and Source")

    # Based on inspection of the website, need to move up to the parent <td> tag
    pmid_p = pmid.parent

    # Then we move to the next <td> tag, denoted as sibling (since they share parent <tr> (Table row) tag)
    pmid_s = pmid_p.next_sibling
    #for child in pmid_s.descendants:
     #   print child
    # Now we search down the tree to find the next table data (<td>) tag
    pmid_c = pmid_s.find("td")
    temp_lit = []
    # Next we print the text of the data
    #print pmid_c.text
    if "No literature is available" in pmid_c.text:
        temp_lit.append("No literature is available")
        print "Not available"
    else:
    # and then print out a list of urls for each pubmed ID we have
        print "The following is available"
        for link in pmid_c.find_all('a'):
            # the <a> tag includes more than just the link address.
            # for each <a> tag found, print the address (href attribute) and extra bits
            # link.string provides the string that appears to be hyperlinked.
            # In this case, it is the pubmedID
            print link.string
            temp_lit.append("PMID: " + link.string + "  URL: " + link.get('href'))
    literature.append(temp_lit)
    print "\n"

因此，似乎元素是將代碼拋出循環的原因。 有沒有一種方法可以搜索帶有文本“ PMID”的任何元素並返回其后的文本（如果有PMID號，則返回url）？ 如果不是，我是否只想檢查每個孩子，尋找我感興趣的文字？

我正在使用Python 2.7.10

Answer 1

import requests
from bs4 import BeautifulSoup
import re
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
urls = ('https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield={}&select=Any'.format(i) for i in gene_listID)

for url in urls: 
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    regex = re.compile(r'http://www.ncbi.nlm.nih.gov/pubmed/\d+')

    a_tag = soup.find('a', href=regex)
    has_pmid = 'PMID' in a_tag.previous_element

    if has_pmid :
        print(a_tag.text, a_tag.next_sibling, a_tag.get("href"))
    else:
        print("Not available")

出：

18984734  [GalNAz-Biotin tagging, CAD MS/MS];  http://www.ncbi.nlm.nih.gov/pubmed/18984734
20068230  [CAD, ETD MS/MS];  http://www.ncbi.nlm.nih.gov/pubmed/20068230
20068230  [CAD, ETD MS/MS];  http://www.ncbi.nlm.nih.gov/pubmed/20068230
Not available
16408927  [Azide-tag, nano-HPLC/tandem MS];   http://www.ncbi.nlm.nih.gov/pubmed/16408927
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS]  http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation

找到第一個與目標網址匹配且以數字結尾的標記，然后檢查其上一個元素中的“ PMID”。 這個網站是如此令人不安，我嘗試了很多次，希望這會有所幫助

使用BeautifulSoup4和Python從不一致的HTML頁面中提取數據

問題描述

1 個解決方案

解決方案1
0 2016-12-06 01:26:45

使用BeautifulSoup4和Python從不一致的HTML頁面中提取數據

問題描述

1 個解決方案

解決方案1 0 2016-12-06 01:26:45

解決方案1
0 2016-12-06 01:26:45