使用漂亮的汤从表格中的行中的单元格中获取值

Question

Working with the HTML from http://coinmarketcap.com/ i'm trying to create a python dictionary containing values from the HTML, for example: 使用来自http://coinmarketcap.com/的HTML我试图创建一个包含HTML值的python字典，例如：

{bitcoin: {Market_cap:'$11,247,442,728', Volume:'$64,668,900'}, ethereum: ....etc} {比特币：{Market_cap：'$ 11,247,442,728'，成交量：'$ 64,668,900'}，以太坊：....等}

How ever i'm unfamiliar with how the HTML is structured. 我怎么也不熟悉HTML的结构。 For some things like the market cap the cell (td) links to the data ie: 对于某些市场上限，单元格（td）链接到数据，即：

<td class="no-wrap market-cap text-right" data-usd="11247442728.0" data-btc="15963828.0">

                      $11,247,442,728 

                </td>

However for cells like the trading volume, the value is a link so the format is different ie: 但是对于像交易量这样的单元格，该值是一个链接，因此格式不同，即：

<td class="no-wrap text-right"> 
                    <a href="/currencies/bitcoin/#markets" class="volume" data-usd="64668900.0" data-btc="91797.5">$64,668,900</a>
                </td>

Here is the code I'm working with: 这是我正在使用的代码：

import requests 
from bs4 import BeautifulSoup as bs

request = requests.get('http://coinmarketcap.com/')

content = request.content

soup = bs(content, 'html.parser')  

table = soup.findChildren('table')[0]

rows = table.findChildren('tr')

for row in rows:
    cells = row.findChildren('td')
    for cell in cells:
        print cell.string

This gives a result with loads of white space and missing data. 这会产生大量空白区域和缺少数据的结果。

For each row how can I get the name of the coin? 对于每一行，我如何获得硬币的名称？ For each cell how can I access each value ? 对于每个单元格，如何访问每个值？ whether it's a link () or a regular value 无论是link（）还是常规值

EDIT: 编辑：

By changing the for loop to: 通过将for循环更改为：

for row in rows:
    cells = row.findChildren('td')
    for cell in cells:
        print cell.getText().strip().replace(" ", "")

I have able to get the data i want, ie: 我能够获得我想要的数据，即：

1
Bitcoin
$11,254,003,178
$704.95
15,964,212
BTC
$63,057,100
-0.11%

However I would be cool to have the class names for each cell, ie 但是我很清楚每个单元格的类名，即

id: bitcoin 
marketcap: 11,254,003,178
etc......

Answer 1

You're almost there. 你快到了。 Instead of using the cell.string method, use cell.getText() . 而不是使用cell.string方法，使用cell.getText() 。 You probably need to do a bit of cleaning of the output strings as well to remove excess white space. 您可能需要对输出字符串进行一些清理以及删除多余的空白区域。 I've used regex, but there's a few other options here as well depending on what state your data is in. I've added a bit of Python 3 compatibility as well with the print function. 我使用了正则表达式，但这里还有一些其他选项，具体取决于您的数据处于什么状态。我已经添加了一些Python 3兼容性以及打印功能。

from __future__ import print_function
import requests
import re

from bs4 import BeautifulSoup as bs

request = requests.get('http://coinmarketcap.com/')

content = request.content

soup = bs(content, 'html.parser')  

table = soup.findChildren('table')[0]

rows = table.findChildren('tr')

for row in rows:
    cells = row.findChildren('td')
    for cell in cells:
        cell_content = cell.getText()
        clean_content = re.sub( '\s+', ' ', cell_content).strip()
        print(clean_content)

The table headings are stored in the first row, so you can extract them like so: 表格标题存储在第一行中，因此您可以像这样提取它们：

headers = [x.getText() for x in rows[0].findChildren('th')]

使用漂亮的汤从表格中的行中的单元格中获取值

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-11-08 05:03:13

使用漂亮的汤从表格中的行中的单元格中获取值

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-11-08 05:03:13

解决方案1
2 已采纳 2016-11-08 05:03:13