How to extract text from a webpage using python 2.7?

I'm trying to programmatically extract text from this webpage which describes a genome assembly in the public archive:


I have thousands of assemblies that I want to track down and extract the study accession, which is the code on the far left of the table beginning with "PRJ". The URL for each of these assemblies is of the same format as the one above, ie " http://www.ebi.ac.uk/ena/data/view/ERS ******". I have the ERS code for each of my assemblies so I can construct the URL for each one.

I've tried a few different methods, firstly if you add "&display=XML" to the end of the URL it prints the XML (or at least I'm presuming that it's printing the XML for the entire page, because the problem is that the study accession "PRJ******" is no where to be seen here). I had utilised this to extract another code that I needed from the same webpage, the run accession which is always of the format "ERR******" using the below code:

import urllib2
from bs4 import BeautifulSoup
import re
import csv

with open('/Users/bj5/Desktop/web_scrape_test.csv','rb') as f:
reader = csv.reader(f) #opens csv containig list of ERS numbers
for row in reader:
    sample = row[0] #reads index 0 (1st row)
    ERSpage = "http://www.ebi.ac.uk/ena/data/view/" + sample + "&display=xml" #creates URL using ERS number from 1st row
    page = urllib2.urlopen(ERSpage) #opens url and assigns it to variable page
    soup = BeautifulSoup(page, "html.parser") #parses the html/xml from page and assigns it to variable called soup
    page_text = soup.text #returns text from variable soup, i.e. no tags
    ERS = re.search('ERS......', page_text, flags=0).group(0) #returns first ERS followed by six wildcards
    ERR = re.search('ERR......', page_text, flags=0).group(0) #retursn first ERR followed by six wildcards
    print ERS + ',' + ERR + ',' + "http://www.ebi.ac.uk/ena/data/view/" + sample #prints ERS,ERR,URL

This worked very well, but as the study accession is not in the XML I can't use it to access this.

I also attempted to use BeautifulSoup again to download the HTML by doing this:

from bs4 import BeautifulSoup
from urllib2 import urlopen

BASE_URL = "http://www.ebi.ac.uk/ena/data/view/ERS019623"

def get_category_links(section_url):
    html = urlopen(section_url).read()
    soup = BeautifulSoup(html, "lxml")
    print soup


But again I can't see the study accession in the output from this either...

I have also attempted to use a different python module, lxml, to parse the XML and HTML but haven't had any luck there either.

When I right click and inspect element on the page I can find the study accession by doing ctrl+F -> PRJ.

So my question is this: what is the code that I'm looking at in inspect element, XML or HTML (or something else)? Why does it look different to the code that prints in my console when I try and use BeautifulSoup to parse HTML? And finally how can I scrape the study accessions (PRJ******) from these webpages?

(I've only been coding for a couple of months and I'm entirely self-taught so apologies for the slightly confused nature of this question but I hope I've got across what it is that I'm trying to do. Any suggestions or advice would be much appreciated.)

In you sample soup is a BeautifulSoup object: a representation of the parsed document.

If you want to print the entire HTML of the document, you can call print(soup.prettify()) or if you want the text within it print(soup.get_text()) .

The soup object has other possibilities to access parts of the document you are interested in: to navigate the parsed tree, to search in it ...

from bs4 import BeautifulSoup
import requests
import re

r = requests.get('http://www.ebi.ac.uk/ena/data/view/ERS019623&display=xml')
soup = BeautifulSoup(r.text, 'lxml')

ERS = soup.find('primary_id').text
ERR = soup.find('id', text=re.compile(r'^ERR')).text
url = 'http://www.ebi.ac.uk/ena/data/view/{}'.format(ERS)

print(ERS, ERR, url)


ERS019623 ERR048142 http://www.ebi.ac.uk/ena/data/view/ERS019623

bs4 can parse xml file, just treat it like html, they are all the same, so their is no need to use regex to extract info.

i find a TEXT download link:


this link's fileds can be changed to get the data you want, like this:


by doing so, you can get all you data in a text file

