简体   繁体   中英

Extract data from dynamic HTML Table with Python 3

I've been working on a python 3 script to generate BibTeX entries, and have ISSN's that I would like to use to get information regarding the associated Journal.

For instance, I would like to take the ISSN 0897-4756 and find that this is Chemistry of Materials journal, which is published by ACS Publications .

I can do this manually using this site , where the info that I am looking for is stored in the lxml table //table[@id="journal-search-results-table"] , or more specifically, in the cells of the table body thereof.

I have, however, not been able to get this to automate successfully using python 3.x

I have attempted to access the data using approaches from the httplib2 , requests , urllib2 , and lxml.html packages, with no success thusfar.

What I have so far is shown below:

import certifi       
import lxml.html
import urllib.request

ISSN = "0897-4756"
address = "https://www.journalguide.com/journals/search?type=journal-name&journal-name={}".format(ISSN)

hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
  'Accept-Encoding': 'none',
  'Accept-Language': 'en-US,en;q=0.8',
  'Connection': 'keep-alive'}

request = urllib.request.Request(address,None,hdr)  #The assembled request
response = urllib.request.urlopen(request)
html = response.read()
tree = lxml.html.fromstring(html)

print(tree.xpath('//table[@id="journal-search-results-table"]/text()'))
# >> ['\n', '\n']
# Shows that I am connecting to the table

print(tree.xpath('//table[@id="journal-search-results-table"]//td/text()'))
# >> []
# Should???? hold the data segments that I am looking for?

Exact page being queryed by the above

From what I can tell, it would appear that the table's tbody element, and thus the tr and td elements that it contains are not being loaded at the time that python is interpretting the HTML - which is accordingly preventing me from reading the data.

How do I make it so that I can read out the Journal Name and Publisher from the specified table above?

Like you mentioned in your question, this table dynamically changes by javascript . To get around this you actually have to render the javascript using:

  • A web driver like selenium which simulates a website the same way it would look to the user (by rendering the javascript)
  • requests-html , which is a relatively new module that allows you to render javascript on a webpage and has a lot of other amazing features for web scraping

This is one way to solve your problem using requests-html:

from requests_html import HTMLSession

ISSN = "0897-4756"
address = "https://www.journalguide.com/journals/search?type=journal-name&journal-name={}".format(ISSN)

hdr = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding': 'none',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive'
}

ses = HTMLSession()
response = ses.get(address, headers=hdr)
response.html.render() # render the javascript to load the elements in the table
tree = response.html.lxml # no need to import lxml.html because requests-html can do this for you

print(tree.xpath('//table[@id="journal-search-results-table"]/text()'))
# >> ['\n', '\n']

print(tree.xpath('//table[@id="journal-search-results-table"]//td/text()'))
# >> ['ACS Publications', '1.905', 'No', '\n', '\n', '\n']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM