如何使用python web-scraping从html代码中读取这些单元格？

Question

I want scraping the exchange prices informations from this website and after take it into a database: https://www.mnb.hu/arfolyamok 我想从这个网站上抓取交换价格信息，然后把它带到一个数据库： https ： //www.mnb.hu/arfolyamok

I need this part of html: 我需要这部分html：

<tbody>
    <tr>
        <td class="valute"><b>CHF</b></td>
        <td class="valutename">svájci frank</td>
        <td class="unit">1</td>
        <td class="value">284,38</td>
    </tr>
    <tr>
        <td class="valute"><b>EUR</b></td>
        <td class="valutename">euro</td>
        <td class="unit">1</td>
        <td class="value">308,54</td>
    </tr>
    <tr>
        <td class="valute"><b>USD</b></td>
        <td class="valutename">USA dollár</td>
        <td class="unit">1</td>
        <td class="value">273,94</td>
    </tr>
</tbody>

Thats why i wrote a code, but something wrong with it. 这就是为什么我写了一个代码，但它有问题。 How can i fix it, where i have to change it? 我怎么能解决它，我必须改变它？ I need only the "valute", "valutename", "unit" and the "value" dataes. 我只需要“valute”，“valutename”，“unit”和“value”数据。 I am working with Python 2.7.13 on Windows 7. 我在Windows 7上使用Python 2.7.13。

The error message is the next: "There's an error in your program: unindent does not match any outer indentation level" 错误消息是下一个： “程序中存在错误：unindent与任何外部缩进级别不匹配”

The code is here: 代码在这里：

import csv
import requests
from BeautifulSoup import BeautifulSoup

url = 'https://www.mnb.hu/arfolyamok'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('tbody', attrs={'class': 'stripe'})

table = str(soup)
table = table.split("<tbody>")

list_of_rows = []
for row in table[1].findAll('tr')[1:]:
    list_of_cells = []
   for cell in row.findAll('td'):
       text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
   list_of_rows.append(list_of_cells)

print list_of_rows

outfile = open("./inmates.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Pénznem", "Devizanév", "Egység", "Forintban kifejezett érték"])
writer.writerows(list_of_rows)

Answer 1

You have a space problem in your code from the line 18 for cell in row.findAll('td'): to line 20 list_of_cells.append(text) . 您的代码中存在space问题，第18 for cell in row.findAll('td'):第20行list_of_cells.append(text) 。 Here's the fixed Code : 这是固定代码：

import csv
import requests
from bs4 import BeautifulSoup

url = 'https://www.mnb.hu/arfolyamok'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('tbody', attrs={'class': 'stripe'})

table = str(soup)
table = table.split("<tbody>")

list_of_rows = []
for row in table[1].findAll('tr')[1:]:
    list_of_cells = []
    for cell in row.findAll('td'):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

print list_of_rows

outfile = open("./inmates.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Pénznem", "Devizanév", "Egység", "Forintban kifejezett érték"])
writer.writerows(list_of_rows)

But, after executing this code, you'll face another problem, that's an character encoding error. 但是，执行此代码后，您将面临另一个问题，即字符编码错误。 It'll read " SyntaxError: Non-ASCII character '\\xc3' in file testoasd.py on line 27, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details " 它将SyntaxError: Non-ASCII character '\\xc3' in file testoasd.py on line 27, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details读取“ SyntaxError: Non-ASCII character '\\xc3' in file testoasd.py on line 27, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details ”

How to fix that? 如何解决？ Simple enough... add the shebang # -*- coding: utf-8 -*- at the very top of your code (1st line). 很简单......在代码的最顶部添加shebang # -*- coding: utf-8 -*- （第1行）。 It should fix it. 它应该解决它。

EDIT : Just noticed that you're using BeautifulSoup in wrong way and importing it wrong as well. 编辑：刚刚注意到你以错误的方式使用BeautifulSoup并导入错误。 I've fixed the import to from bs4 import BeautifulSoup and when using BeautifulSoup, you need to specify a parser as well. 我已经from bs4 import BeautifulSoup修复了导入，当使用BeautifulSoup时，你还需要指定一个解析器。 So, 所以，

soup = BeautifulSoup(html)

would become : 会成为：

soup = BeautifulSoup(html, "html.parser")

如何使用python web-scraping从html代码中读取这些单元格？

问题描述

1 个解决方案

解决方案1
0 2017-07-13 13:33:24

如何使用python web-scraping从html代码中读取这些单元格？

问题描述

1 个解决方案

解决方案1 0 2017-07-13 13:33:24

解决方案1
0 2017-07-13 13:33:24