简体   繁体   中英

How can i read these cells from an html code with python web-scraping?

I want scraping the exchange prices informations from this website and after take it into a database: https://www.mnb.hu/arfolyamok

I need this part of html:

<tbody>
    <tr>
        <td class="valute"><b>CHF</b></td>
        <td class="valutename">svájci frank</td>
        <td class="unit">1</td>
        <td class="value">284,38</td>
    </tr>
    <tr>
        <td class="valute"><b>EUR</b></td>
        <td class="valutename">euro</td>
        <td class="unit">1</td>
        <td class="value">308,54</td>
    </tr>
    <tr>
        <td class="valute"><b>USD</b></td>
        <td class="valutename">USA dollár</td>
        <td class="unit">1</td>
        <td class="value">273,94</td>
    </tr>
</tbody>

Thats why i wrote a code, but something wrong with it. How can i fix it, where i have to change it? I need only the "valute", "valutename", "unit" and the "value" dataes. I am working with Python 2.7.13 on Windows 7.

The error message is the next: "There's an error in your program: unindent does not match any outer indentation level"

The code is here:

import csv
import requests
from BeautifulSoup import BeautifulSoup

url = 'https://www.mnb.hu/arfolyamok'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('tbody', attrs={'class': 'stripe'})

table = str(soup)
table = table.split("<tbody>")

list_of_rows = []
for row in table[1].findAll('tr')[1:]:
    list_of_cells = []
   for cell in row.findAll('td'):
       text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
   list_of_rows.append(list_of_cells)

print list_of_rows

outfile = open("./inmates.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Pénznem", "Devizanév", "Egység", "Forintban kifejezett érték"])
writer.writerows(list_of_rows)

You have a space problem in your code from the line 18 for cell in row.findAll('td'): to line 20 list_of_cells.append(text) . Here's the fixed Code :

import csv
import requests
from bs4 import BeautifulSoup

url = 'https://www.mnb.hu/arfolyamok'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('tbody', attrs={'class': 'stripe'})

table = str(soup)
table = table.split("<tbody>")

list_of_rows = []
for row in table[1].findAll('tr')[1:]:
    list_of_cells = []
    for cell in row.findAll('td'):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

print list_of_rows

outfile = open("./inmates.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Pénznem", "Devizanév", "Egység", "Forintban kifejezett érték"])
writer.writerows(list_of_rows)

But, after executing this code, you'll face another problem, that's an character encoding error. It'll read " SyntaxError: Non-ASCII character '\\xc3' in file testoasd.py on line 27, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details "

How to fix that? Simple enough... add the shebang # -*- coding: utf-8 -*- at the very top of your code (1st line). It should fix it.

EDIT : Just noticed that you're using BeautifulSoup in wrong way and importing it wrong as well. I've fixed the import to from bs4 import BeautifulSoup and when using BeautifulSoup, you need to specify a parser as well. So,

soup = BeautifulSoup(html)

would become :

soup = BeautifulSoup(html, "html.parser")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM