如何使用python web-scraping从html代码中读取这些单元格？

Question

我想从这个网站上抓取交换价格信息，然后把它带到一个数据库： https ： //www.mnb.hu/arfolyamok

我需要这部分html：

<tbody>
    <tr>
        <td class="valute"><b>CHF</b></td>
        <td class="valutename">svájci frank</td>
        <td class="unit">1</td>
        <td class="value">284,38</td>
    </tr>
    <tr>
        <td class="valute"><b>EUR</b></td>
        <td class="valutename">euro</td>
        <td class="unit">1</td>
        <td class="value">308,54</td>
    </tr>
    <tr>
        <td class="valute"><b>USD</b></td>
        <td class="valutename">USA dollár</td>
        <td class="unit">1</td>
        <td class="value">273,94</td>
    </tr>
</tbody>

这就是为什么我写了一个代码，但它有问题。 我怎么能解决它，我必须改变它？ 我只需要“valute”，“valutename”，“unit”和“value”数据。 我在Windows 7上使用Python 2.7.13。

错误消息是下一个： “程序中存在错误：unindent与任何外部缩进级别不匹配”

代码在这里：

import csv
import requests
from BeautifulSoup import BeautifulSoup

url = 'https://www.mnb.hu/arfolyamok'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('tbody', attrs={'class': 'stripe'})

table = str(soup)
table = table.split("<tbody>")

list_of_rows = []
for row in table[1].findAll('tr')[1:]:
    list_of_cells = []
   for cell in row.findAll('td'):
       text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
   list_of_rows.append(list_of_cells)

print list_of_rows

outfile = open("./inmates.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Pénznem", "Devizanév", "Egység", "Forintban kifejezett érték"])
writer.writerows(list_of_rows)

Answer 1

您的代码中存在space问题，第18 for cell in row.findAll('td'):第20行list_of_cells.append(text) 。 这是固定代码：

import csv
import requests
from bs4 import BeautifulSoup

url = 'https://www.mnb.hu/arfolyamok'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('tbody', attrs={'class': 'stripe'})

table = str(soup)
table = table.split("<tbody>")

list_of_rows = []
for row in table[1].findAll('tr')[1:]:
    list_of_cells = []
    for cell in row.findAll('td'):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

print list_of_rows

outfile = open("./inmates.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Pénznem", "Devizanév", "Egység", "Forintban kifejezett érték"])
writer.writerows(list_of_rows)

但是，执行此代码后，您将面临另一个问题，即字符编码错误。 它将SyntaxError: Non-ASCII character '\\xc3' in file testoasd.py on line 27, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details读取“ SyntaxError: Non-ASCII character '\\xc3' in file testoasd.py on line 27, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details ”

如何解决？ 很简单......在代码的最顶部添加shebang # -*- coding: utf-8 -*- （第1行）。 它应该解决它。

编辑：刚刚注意到你以错误的方式使用BeautifulSoup并导入错误。 我已经from bs4 import BeautifulSoup修复了导入，当使用BeautifulSoup时，你还需要指定一个解析器。 所以，

soup = BeautifulSoup(html)

会成为：

soup = BeautifulSoup(html, "html.parser")

如何使用python web-scraping从html代码中读取这些单元格？

问题描述

1 个解决方案

解决方案1
0 2017-07-13 13:33:24

如何使用python web-scraping从html代码中读取这些单元格？

问题描述

1 个解决方案

解决方案1 0 2017-07-13 13:33:24

解决方案1
0 2017-07-13 13:33:24