简体   繁体   中英

Python: Is there a way to scrape the abstract text from the articles within each of the href links on a search-results-page of an online database?

When I input search-terms into the search-bar of an online database (pubmed- a database for scientific articles) I get a list of that links to articles that result from the search. I want to click on each of the links, open each in a new tab-window and copy the text from the abstract (summary-article) so that I can paste each one into a file.

I recently found out that it might be a lot more useful to do this with python. I am aware that I can scrape the url-data as follows:

import requests
import bs4

root_url = 'https://www.ncbi.nlm.nih.gov/pubmed'
index_url = root_url + '/?term=%28histone%29+AND+%28chromatin%29+AND+%28hESC%29'

def get_video_page_urls():
    response = requests.get(index_url)
    soup = bs4.BeautifulSoup(response.text)
    return [a.attrs.get('href') for a in soup.select('div.rprt a[href^=/pubmed]')]

print(get_video_page_urls())

['/pubmed/27939217', '/pubmed?linkname=pubmed_pubmed&from_uid=27939217'..... etc.

My question is: can collect the abstract-text (similarly to clicking into the link and copy-pasting the text) from each of the href links that result from the search, and subsequently analyse them?

Initially, I tried:

import requests
r=requests.get('https://www.ncbi.nlm.nih.gov/pubmed/term=%28histone%29+AND+%28chromatin%29+AND+%28hESC%29')
r.content

The output of this results in all the html text that makes the search-results page but I cannot seem to find a distinct pattern that specifies the text that is linked to by each of the hrefs. So I'm wondering how I can isolate text that is on a different page...?

Soup is designed to handle poorly structured pages heuristically. For cleaner pages and straightforward data scraping like this, I prefer LXML with Xpath calls. To find the Xpaths of the on-page content you're after, either use the Inspect feature of your browser or a browser plugin like XPath Helper Wizard.

This will dump out the first 20 results and abstracts to a CSV. To do more search terms, put it all in a loop with a list of terms. To fetch more results than the default 20, add the parameter dispmax=##, to the URL, eg https://www.ncbi.nlm.nih.gov/pubmed?term=((histone)%20AND%20chromatin)%20AND%20ESC&dispmax=100

python
import unicodecsv as csv
from lxml import html
import lxml.html.clean
import requests
csv_out = open('PubMed_Abstracts.csv', 'ab')
writer = csv.writer(csv_out, dialect='excel', delimiter=',', encoding='utf-8')
writer.writerow(['Search_Term', 'Result', 'Title', 'URL', 'Abstract'])
Search_Term = '((histone)%20AND%20chromatin)%20AND%20ESC'
Search_URL = 'https://www.ncbi.nlm.nih.gov/pubmed?term=' + Search_Term #To fetch more results than the default 20, add the parameter dispmax=## to the URL, e.g. https://www.ncbi.nlm.nih.gov/pubmed?term=((histone)%20AND%20chromatin)%20AND%20ESC&dispmax=100
Search_Page = requests.get(Search_URL)
Search_Tree = html.fromstring(Search_Page.content)
# total number of results
Search_Results = Search_Tree.xpath('//h3[@class="result_count left"]/text()')
Num_Results = str([' '.join(str(result).split()) for result in Search_Results])
Num_Results_Val = Num_Results[Num_Results.find('of') + 3:-2]
# Links for results 1-20
title_cleaner = lxml.html.clean.Cleaner(allow_tags=['div', 'p', 'a'], remove_unknown_tags=False)
Title_Tree = title_cleaner.clean_html(Search_Tree)
Pub_Results = Title_Tree.xpath('//div[@class="rprt"]/div[@class="rslt"]/p[@class="title"]/a')
r = 1
for Pub_Result in Pub_Results:
    Result_Num = str(r) + '/' + str(Num_Results_Val)
    Pub_Title = ' '.join(Pub_Result.text_content().split())
    Rel_URL = Pub_Result.get('href')
    Pub_URL = Rel_URL.replace('/pubmed/', 'https://www.ncbi.nlm.nih.gov/pubmed/')
    Pub_Page = requests.get(Pub_URL)
    Pub_Tree = html.fromstring(Pub_Page.content)
    Abstract = ''.join(Pub_Tree.xpath('//abstracttext/text()'))
    writer.writerow([Search_Term, Result_Num, Pub_Title, Pub_URL, Abstract])
    r += 1

csv_out.close()
exit()

You can work on this more. This is what I have got:

url="https://www.ncbi.nlm.nih.gov/pubmed/28034892"
r = requests.get(url)
print BeautifulSoup(r.content).select('div.abstr')[0].prettify()

To get all the abstracts from those urls you can use this:

for a in set(get_video_page_urls()):
    if len(a)<40:
        url="https://www.ncbi.nlm.nih.gov" + a
        r = requests.get(url)
        print BeautifulSoup(r.content).select('div.abstr')[0].prettify()

Instead of printing it to screen you can save it to some file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM