简体   繁体   中英

Extract data from html to csv using BeautifulSoup

I want to extract data from a weather site and copy it to a csv file for further analysis. I am using python and BeautifulSoup. I have been struggling in order to get the affected cities from the weather report and the values. Hier is how the HTML looks like:

> <html>  <head>   <meta charset="utf-8"/>  </head>  <body>   <div
> id="main">    <div id="wettertab">
>     <p>
>      <strong>
>       Letzte Aktualisierung: Do, 10. Aug, 18:41 Uhr
>      </strong>
>     </p>
>     <h1 id="Hessen">
>      Hessen
>     </h1>
>     <h2 id="Gemeinde Aarbergen">
>      Gemeinde Aarbergen
>     </h2>
>     <table>
>      <colgroup>
>       <col <="" class="firstColumn" col=""/>
>       <col class="colorColumn"/>
>       <col class="colorColumn"/>
>       <col class="colorColumn"/>
>       <thead>
>        <tr>
>         <th>
>          Schlagzeile
>         </th>
>         <th>
>          Gültig von
>         </th>
>         <th>
>          Gültig bis
>         </th>
>         <th>
>          Beschreibung
>         </th>
>        </tr>
>       </thead>
>       <tr>
>        <td>
>         Amtliche WARNUNG vor DAUERREGEN
>        </td>
>        <td>
>         Do, 10. Aug, 12:00 Uhr
>        </td>
>        <td>
>         Sa, 12. Aug, 06:00 Uhr
>        </td>
>        <td>
>         Es tritt Dauerregen mit Unterbrechungen auf. Dabei werden Niederschlagsmengen zwischen 40 l/m² und 60 l/m² erwartet.
>        </td>
>       </tr>
>      </colgroup>
>     </table>

There are four values from the tables that I need:

<tr> 
<td> Amtliche WARNUNG vor DAUERREGEN 
</td> 
<td> Do, 10. Aug, 12:00 Uhr 
</td> 
<td> Sa, 12. Aug, 06:00 Uhr 
</td> 
<td> Es tritt Dauerregen mit Unterbrechungen auf. Dabei werden Niederschlagsmengen zwischen 40 l/m² und 60 l/m² erwartet. 
</td> 
</tr>

And I also need the name of the place:

<h2 id="Gemeinde Aarbergen">
 Gemeinde Aarbergen
</h2>

The HTML tag for "h2" is always before the table but it dosen't belong to the table itself, as I can see.

This is my code snippet until now:

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("html_warnung.html")
soup = BeautifulSoup(html, 'html.parser')

table = soup.findAll("table")
for div in table:
    row = ''
    rows = div.findAll('td')

    for row in rows:
        print(row.text)

Now I can print the values from the tables, and I can also get the city name by:

gemeinde_list = []
for gemeinde in soup.findAll('h2'):
    gemeinde_list.append(gemeinde.get("id"))

What would be the best way to export all the infos togehter to csv-file, in order to have separeted values:

Gemeinde Aarbergen
Amtliche WARNUNG vor DAUERREGEN
Do, 10. Aug, 12:00 Uhr
Sa, 12. Aug, 06:00 Uhr
Es tritt Dauerregen wechselnder Intensität auf. Dabei werden Niederschlagsmengen zwischen 35 l/m² und 50 l/m² erwartet. In Staulagen werden Mengen bis 70 l/m² erreicht.

I am using Python 3.6 Please some help.

Since neither the table or heading have any characteristic attributes, you can use the find_next_siblings / find_previous_siblings methods to get neighbouring tags.

tables = soup.find_all('table')
data = []
for table in tables: 
    previous = table.find_previous_siblings('h2') 
    id = previous[0].get('id') if previous else None
    rows = [td.get_text(strip=True) for td in table.find_all('td')]
    data.append([id] + rows)

The data variable is a nested list which you can now write to csv.

with open('my_file.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerows(data)

You can put the data you want to save in a csv row into a tuple. Basically, assign them to a variable while you are extracting them and put all of them into a tuple. I do not fully understand the structure of the data you are extracting.

But I guess:

city_name = "Gemeinde Aarbergen"
start_date = "Do, 10. Aug, 12:00 Uhr"
end_date = "Sa, 12. Aug, 06:00 Uhr"
desc = "Es tritt Dauerregen wechselnder Intensität auf. Dabei werden Niederschlagsmengen zwischen 35 l/m² und 50 l/m² erwartet. In Staulagen werden Mengen bis 70 l/m² erreicht."

As I said I dont know what the fields are. you can name them better. Then you will have:

import csv
csv_row = (city_name, start_date, end_date, desc)
with open(filename, "wb") as csv_file:
    writer = csv.writer(csv_file, delimiter=',')
    writer.writerow(csv_row)

Hope this makes sense.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM