I've been working on a python 3 script to generate BibTeX entries, and have ISSN's that I would like to use to get information regarding the associated Journal.
For instance, I would like to take the ISSN 0897-4756
and find that this is Chemistry of Materials
journal, which is published by ACS Publications
.
I can do this manually using this site , where the info that I am looking for is stored in the lxml table //table[@id="journal-search-results-table"]
, or more specifically, in the cells of the table body thereof.
I have, however, not been able to get this to automate successfully using python 3.x
I have attempted to access the data using approaches from the httplib2
, requests
, urllib2
, and lxml.html
packages, with no success thusfar.
What I have so far is shown below:
import certifi
import lxml.html
import urllib.request
ISSN = "0897-4756"
address = "https://www.journalguide.com/journals/search?type=journal-name&journal-name={}".format(ISSN)
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
request = urllib.request.Request(address,None,hdr) #The assembled request
response = urllib.request.urlopen(request)
html = response.read()
tree = lxml.html.fromstring(html)
print(tree.xpath('//table[@id="journal-search-results-table"]/text()'))
# >> ['\n', '\n']
# Shows that I am connecting to the table
print(tree.xpath('//table[@id="journal-search-results-table"]//td/text()'))
# >> []
# Should???? hold the data segments that I am looking for?
Exact page being queryed by the above
From what I can tell, it would appear that the table's tbody
element, and thus the tr
and td
elements that it contains are not being loaded at the time that python is interpretting the HTML - which is accordingly preventing me from reading the data.
How do I make it so that I can read out the Journal Name and Publisher from the specified table above?
Like you mentioned in your question, this table dynamically changes by javascript
. To get around this you actually have to render the javascript
using:
javascript
on a webpage and has a lot of other amazing features for web scraping This is one way to solve your problem using requests-html:
from requests_html import HTMLSession
ISSN = "0897-4756"
address = "https://www.journalguide.com/journals/search?type=journal-name&journal-name={}".format(ISSN)
hdr = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
ses = HTMLSession()
response = ses.get(address, headers=hdr)
response.html.render() # render the javascript to load the elements in the table
tree = response.html.lxml # no need to import lxml.html because requests-html can do this for you
print(tree.xpath('//table[@id="journal-search-results-table"]/text()'))
# >> ['\n', '\n']
print(tree.xpath('//table[@id="journal-search-results-table"]//td/text()'))
# >> ['ACS Publications', '1.905', 'No', '\n', '\n', '\n']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.