简体   繁体   中英

Scraping PubMed using bs4

I have a dataset (a CSV file) of PubMed id's for which I need to iterate through and for each get the title, year published, abstract and MeSH terms, which I then need to save into a CSV file with this format:

id year_published title abstract mesh_terms     

where each of the items is in a distinct, separate column. I attempted to use bs4 to do this and wrote this:

import urllib2
from bs4 import BeautifulSoup
import csv

CSVfile = open('srData.csv')
fileReader = csv.reader(CSVfile)
Data = list(fileReader)
i = 0

with open('blank.csv','wb') as f1:
 writer=csv.writer(f1, delimiter='\t',lineterminator='\n',)
 for id in Data:
    try:
        soup = BeautifulSoup(urllib2.urlopen("http://www.ncbi.nlm.nih.gov/pubmed/" & id).read())
        jouryear = soup.find_all(attrs={"class": "cit"})
        year = jouryear[0].get_text()
        yearlength = len(year)
        titleend = year.find(".")
        year1 = titleend+2
        year2 = year1+1
        year3 = year2+1
        year4 = year3+1
        year5 = year4+1
        published_date = (year[year1:year5])

        title = soup.find_all(attrs={"class": "rprt abstract"})
        title = (title[0].h1.string)

        abstract = (soup.find_all(attrs={"class": "abstr"}))
        abstract = (abstract[0].p.string)
        writer.writerow([id, published_date, title, abstract])
    except:
        writer.writerow([id, "error"])
        print (id)
    i = i+1
    print i

However, this throws an error about appending a list to a URL. How can I fix this.

CSVfile = open('srData.csv')
fileReader = csv.reader(CSVfile)
Data = list(fileReader)

After these lines, Data is a list of lists. Each sublist is one line/row of the CSV. That means that when you iterate over it:

for id in Data:

you get a list each time. Rather say:

for row in Data:
    id = row[0]

Also "http://www.ncbi.nlm.nih.gov/pubmed/" & id is definitely wrong. Use + , not & .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM