简体   繁体   English

使用python抓取网页数据?

[英]Web Scraping data using python?

I just started learning web scraping using Python.我刚开始使用 Python 学习网页抓取。 However, I've already ran into some problems.但是,我已经遇到了一些问题。

My goal is to web scrape the names of the different tuna species from fishbase.org ( http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon )我的目标是从 fishbase.org ( http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon ) 网络抓取不同金枪鱼物种的名称

The problem: I'm unable to extract all of the species names.问题:我无法提取所有物种名称。

This is what I have so far:这是我到目前为止:

import urllib2
from bs4 import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)

soup = BeautifulSoup(html_doc)

spans = soup.find_all(

From here, I don't know how I would go about extracting the species names.从这里开始,我不知道如何提取物种名称。 I've thought of using regex (ie soup.find_all("a", text=re.compile("\\d+\\s+\\d+")) to capture the texts inside the tag...我想过使用正则表达式(即soup.find_all("a", text=re.compile("\\d+\\s+\\d+"))来捕获标签内的文本......

Any input will be highly appreciated!任何输入将不胜感激!

You might as well take advantage of the fact that all the scientific names (and only scientific names) are in <i/> tags:您不妨利用以下事实,即所有科学名称(并且只有科学名称)都在<i/>标签中:

scientific_names = [it.text for it in soup.table.find_all('i')]

Using BS and RegEx are two different approaches to parsing a webpage.使用 BS 和 RegEx 是解析网页的两种不同方法。 The former exists so you don't have to bother so much with the latter.前者存在,因此您不必为后者而烦恼。

You should read up on what BS actually does, it seems like you're underestimating its utility.您应该仔细阅读 BS 的实际用途,似乎您低估了它的效用。

What jozek suggests is the correct approach, but I couldn't get his snippet to work (but that's maybe because I am not running the BeautifulSoup 4 beta). jozek建议的是正确的方法,但我无法让他的片段起作用(但这可能是因为我没有运行 BeautifulSoup 4 测试版)。 What worked for me was:对我有用的是:

import urllib2
from BeautifulSoup import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)

soup = BeautifulSoup(page)

scientific_names = [it.text for it in soup.table.findAll('i')]

print scientific_names

Looking at the web page, I'm not sure exactly about what information you want to extract.查看网页,我不确定您要提取哪些信息。 However, note that you can easily get the text in a tag using the text attribute:但是,请注意,您可以使用text属性轻松获取标签中的text

>>> from bs4 import BeautifulSoup
>>> html = '<a>some text</a>'
>>> soup = BeautifulSoup(html)
>>> [tag.text for tag in soup.find_all('a')]
[u'some text']

Thanks everyone...I was able to solve the problem I was having with this code:谢谢大家......我能够解决这个代码的问题:

import urllib2
from bs4 import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon'
page = urllib2.urlopen(fish_url)
html_doc = page.read()
soup = BeautifulSoup(html_doc)

scientific_names = [it.text for it in soup.table.find_all('i')]

for item in scientific_names:
print item

If you want a long term solution, try scrapy .如果你想要一个长期的解决方案,试试scrapy It is quite simple and does a lot of work for you.它非常简单,可以为您完成很多工作。 It is very customizable and extensible.它是非常可定制和可扩展的。 You will extract all the URLs you need using xpath, which is more pleasant and reliable.您将使用 xpath 提取您需要的所有 URL,这更令人愉快和可靠。 Still scrapy allows you to use re, if you need.如果需要,scrapy 仍然允许您使用 re。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM