简体   繁体   中英

Get PubMed Data from ID using bs4

I am working on a project to download title, abstract, year published and MeSH terms from a CSV file of ~12,000 PubMed IDs. I have written the code below:

import urllib2
from bs4 import BeautifulSoup
import csv

CSVfile = open('srData.csv')
fileReader = csv.reader(CSVfile)
Data = list(fileReader)
i = 0

with open('blank.csv','wb') as f1:
 writer=csv.writer(f1, delimiter='\t',lineterminator='\n',)
 for id in Data:
    soup = BeautifulSoup(urllib2.urlopen("http://www.ncbi.nlm.nih.gov/pubmed/" & id).read())
    jouryear = soup.find_all(attrs={"class": "cit"})
    year = jouryear[0].get_text()
    yearlength = len(year)
    titleend = year.find(".")
    year1 = titleend+2
    year2 = year1+1
    year3 = year2+1
    year4 = year3+1
    year5 = year4+1
    published_date = (year[year1:year5])

    title = soup.find_all(attrs={"class": "rprt abstract"})
    title = (title[0].h1.string)

    abstract = (soup.find_all(attrs={"class": "abstr"}))
    abstract = (abstract[0].p.string)
    writer.writerow([published_date, title, abstract])
    i = i+1
    print i

When I run it, I get the following error:

TypeError: unsupported operand type(s) for &: 'str' and 'list'

How can I fix this? I also experience a problem where the year and the title and written in the same cell, but I need them in distinct columns. What can I do to fix this?

I don't know how your srData.csv file looks like but if it is just a list of IDs, eg

27383269
27281200

you would to use id[0] instead of id , otherwise your are concatenating a list and a string .

In order to get the published data, title and abstract, you can get the data with the following lines of code:

published_date = soup.find_all(attrs={"class": "cit"})[0].get_text().split('.')[1].split(';')[0].strip()
        title = soup.find_all(attrs={"class": "rprt abstract"})[0].h1.string
        abstract = soup.find_all(attrs={"class": "abstr"})[0].p.string
        writer.writerow([published_date, title.encode('ascii', 'ignore'), abstract.encode('ascii', 'ignore')])

Date is a bit tricky and needs to be extracted from the whole citation but all the other ones can read directly.

Output for Pubmed ID 27383269 :

2016 Jul 7 Molecular dynamics-based refinement and validation for sub-5 cryo-electron microscopy maps. Two structure determination methods, based on the molecular dynamics flexible fitting (MDFF) [...]

Make sure to remove non-ascii characters via encode , otherwise a lot of abstracts and titles will give you errors.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM