简体   繁体   中英

Web Scraping data using python?

I just started learning web scraping using Python. However, I've already ran into some problems.

My goal is to web scrape the names of the different tuna species from fishbase.org ( http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon )

The problem: I'm unable to extract all of the species names.

This is what I have so far:

import urllib2
from bs4 import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)

soup = BeautifulSoup(html_doc)

spans = soup.find_all(

From here, I don't know how I would go about extracting the species names. I've thought of using regex (ie soup.find_all("a", text=re.compile("\\d+\\s+\\d+")) to capture the texts inside the tag...

Any input will be highly appreciated!

You might as well take advantage of the fact that all the scientific names (and only scientific names) are in <i/> tags:

scientific_names = [it.text for it in soup.table.find_all('i')]

Using BS and RegEx are two different approaches to parsing a webpage. The former exists so you don't have to bother so much with the latter.

You should read up on what BS actually does, it seems like you're underestimating its utility.

What jozek suggests is the correct approach, but I couldn't get his snippet to work (but that's maybe because I am not running the BeautifulSoup 4 beta). What worked for me was:

import urllib2
from BeautifulSoup import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)

soup = BeautifulSoup(page)

scientific_names = [it.text for it in soup.table.findAll('i')]

print scientific_names

Looking at the web page, I'm not sure exactly about what information you want to extract. However, note that you can easily get the text in a tag using the text attribute:

>>> from bs4 import BeautifulSoup
>>> html = '<a>some text</a>'
>>> soup = BeautifulSoup(html)
>>> [tag.text for tag in soup.find_all('a')]
[u'some text']

Thanks everyone...I was able to solve the problem I was having with this code:

import urllib2
from bs4 import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon'
page = urllib2.urlopen(fish_url)
html_doc = page.read()
soup = BeautifulSoup(html_doc)

scientific_names = [it.text for it in soup.table.find_all('i')]

for item in scientific_names:
print item

If you want a long term solution, try scrapy . It is quite simple and does a lot of work for you. It is very customizable and extensible. You will extract all the URLs you need using xpath, which is more pleasant and reliable. Still scrapy allows you to use re, if you need.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM