简体   繁体   English

使用BeautifulSoup 4从网页中提取表格

[英]Extracting tables from a webpage using BeautifulSoup 4

Be forgiving, only started using beautifulSoup today to deal with this problem. 宽恕,今天才开始使用beautifulSoup来处理此问题。

I've managed to get it working by dragging in the URL's on the website, Each of the product pages on this website has a table that looks like the following: 我已经设法通过拖拽网站上的URL使其正常工作,该网站上的每个产品页面都有一个如下表的表格:

<table width="100%" class="product-feature-table">
  <tbody>
    <tr>
      <td align="center"><table cellspacing="0" class="stats2">
        <tbody>
          <tr>
          <td class="hed" colspan="2">YYF Shutter Stats:</td>
          </tr>
          <tr>
          <td>Diameter:</td>
          <td>56 mm / 2.20 inches</td>
          </tr>
          <tr>
            <td>Width:</td>
            <td>44.40 mm / 1.74 inches</td>
          </tr>
          <tr>
            <td>Gap Width:</td>
            <td>4.75 mm / .18 inches</td>
          </tr>
          <tr>
            <td>Weight:</td>
            <td>67.8 grams</td>
          </tr>
          <tr>
            <td>Bearing Size:</td>
            <td>Size C (.250 x .500 x .187)<br>CBC SPEC Bearing</td>
          </tr>
          <tr>
            <td>Response:</td>
            <td>CBC Silicone Slim Pad (19mm)</td>
          </tr>
        </tbody>
        </table>
      <br>
      <br>
      </td>
    </tr>
  </tbody>
</table>

I'm trying to pull this table into some form of data that I could work with within a webapp. 我正在尝试将此表放入可在Webapp中使用的某种形式的数据。

How would I go about extracting this from each webpage, the website has around 400 product pages that include this table, I'd preferably like to get each of the tables from the page and put it into a database entry or text file with the name of the product. 我将如何从每个网页中提取内容,该网站有大约400个包含此表的产品页面,我最好从页面中获取每个表并将其放入名称为数据库的条目或文本文件中产品的

As you can see the table isn't exactly formatted well, but it is the only table on the page labeled with 如您所见,表格的格式不正确,但是它是页面上唯一标有的表格

class="product-feature-table"

I have just been trying to edit a URL scraping script but I'm starting to get the feeling I'm going about it all wrong trying to do that. 我一直在尝试编辑URL抓取脚本,但是我开始感觉到我正在尝试这样做,这都是错误的。

My url script is as follows: 我的网址脚本如下:

import urllib2
from bs4 import BeautifulSoup

url = raw_input('Web-Address: ')

html = urllib2.urlopen('http://' +url).read()
soup = BeautifulSoup(html)
soup.prettify()
for anchor in soup.findAll('a', href=True):
    print anchor['href']

I can get all these URL's into a text file but would much prefer to use Sqlite or Postgresql, are there any articles online that would help me understand these concepts better, that don't drown the newbie? 我可以将所有这些URL都输入一个文本文件中,但是更喜欢使用Sqlite或Postgresql,在线上是否有任何文章可以帮助我更好地理解这些概念,而不会淹没新手?

You have probably already been here, but when I used BS (no pun intended) a while back, it's doc page was where I started: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ 您可能已经来过这里,但是当我使用BS(无双关语)时,这是我开始的文档页面: http : //www.crummy.com/software/BeautifulSoup/bs4/doc/

Personally, I found this official documentation could have been better, and the Beautiful Soup resources from the online community also seemed lacking at the time - this was about 3 or 4 years ago though. 就个人而言,我发现此官方文档可能会更好,并且当时似乎也缺少在线社区的Beautiful Soup资源-大约3或4年前。

I hope both have come farther since. 从那以后,我希望两者都走得更远。

Another resource perhaps worth looking into is Mechanize: http://wwwsearch.sourceforge.net/mechanize/ 也许值得研究的另一个资源是机械化: http : //wwwsearch.sourceforge.net/mechanize/

First of all, if you want to extract all the tables inside a site using BeautifulSoup you could do it in the following way : 首先,如果要使用BeautifulSoup提取站点内的所有表,可以通过以下方式进行:

import urllib2
from bs4 import BeautifulSoup

url = raw_input('Web-Address: ')

html = urllib2.urlopen('http://' +url).read()
soup = BeautifulSoup(html)
soup.prettify()

# extract all the tables in the HTML 
tables = soup.find_all('table')

#get the class name for each
for table in tables:
  class_name = table['class']

Once you have all the tables in the page you could do anything you want with its data moving for the tags tr and td in the following way : 在页面中拥有所有表之后,您可以通过以下方式对标记trtd的数据进行任何操作:

for table in tables:
  tr_tags = table.find_all('tr')

Remember that the tr tags are rows inside the table. 请记住, tr标记是表中的行。 Then to obtain the data inside the tags td you could use something like this : 然后,要获取标签td内的数据,您可以使用以下代码:

for table in tables:
  tr_tags = table.find_all('tr')

  for tr in tr_tags:
    td_tags = tr.find_all('td')

    for td in td_tags:
      text = td.string  

If you want to surf in all the links inside the table and then find the tables the code explained above would work for you, making first the retrieve of all the urls inside an then moving between them. 如果您想浏览表中的所有链接,然后找到表,上面解释的代码将为您工作,首先获取内部的所有URL,然后在它们之间移动。 For example : 例如 :

initial_url = 'URL'
list_of_urls = []

list_of_url.append(initial_url)

while len(list_of_urls) > 0:

  html = urllib2.urlopen('http://' + list_of_url.pop()).read()
  soup = BeautifulSoup(html)
  soup.prettify()

  for anchor in soup.find_all('a', href=True):
     list_of_urls.append(anchor['href'])

  #here put the code explained above, for example

  for table in tables:
    class_name = table['class']

    # continue with the above code..

To insert the data to a database in SQLite I recommend you read the following tutorial Python: A Simple Step-by-Step SQLite Tutorial 要将数据插入SQLite中的数据库,我建议您阅读以下教程Python:简单的分步SQLite教程

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM