简体   繁体   中英

Writing the news to CSV-file (Python 3, BeautifulSoup)

I want Python3.6 to write the output of the following code into a csv. It would be very nice to have it like this: one row for every article (it's a News-Website ) and four columns with "Title", "URL", "Category" [#Politik, etc.], "PublishedAt".

from bs4 import BeautifulSoup
import requests

website = 'http://spiegel.de/schlagzeilen'
r = requests.get(website)
soup = BeautifulSoup((r.content), "lxml")

div = soup.find("div", {"class": "schlagzeilen-content schlagzeilen-overview"})

for a in div.find_all('a', title=True):
    print(a.text, a.find_next_sibling('span').text)
    print(a.get('href'))

For writing to a csv I already have this...

with open('%s_schlagzeilen.csv' % datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S.%f'), 'w', newline='',
              encoding='utf-8') as file:
        w = csv.writer(file, delimiter="|")
        w.writerow([...])

..and need to know what's next to do. THX!! in advance!

You can collect all the desired extracted fields into a list of dictionaries and use thecsv.DictWriter to write to the CSV file:

import csv
import datetime

from bs4 import BeautifulSoup
import requests


website = 'http://spiegel.de/schlagzeilen'
r = requests.get(website)
soup = BeautifulSoup((r.content), "lxml")

articles = []
for a in soup.select(".schlagzeilen-content.schlagzeilen-overview a[title]"):
    category, published_at = a.find_next_sibling(class_="headline-date").get_text().split(",")

    articles.append({
        "Title": a.get_text(),
        "URL": a.get('href'),
        "Category": category.strip(" ()"),
        "PublishedAt": published_at.strip(" ()")
    })

filename = '%s_schlagzeilen.csv' % datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S.%f')
with open(filename, 'w', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=["Title", "URL", "Category", "PublishedAt"], )

    writer.writeheader()
    writer.writerows(articles)

Note how we are locating the categories and the "published at" - we need to go to the next sibling element and split the text by comma, stripping out the extra parenthesis.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM