[英]Writing the news to CSV-file (Python 3, BeautifulSoup)
I want Python3.6 to write the output of the following code into a csv.我希望 Python3.6 将以下代码的输出写入 csv。 It would be very nice to have it like this: one row for every article (it's a News-Website ) and four columns with "Title", "URL", "Category" [#Politik, etc.], "PublishedAt".像这样拥有它会非常好:每篇文章都有一行(它是新闻网站),四列包含“标题”、“URL”、“类别”[#Politik 等]、“PublishedAt”。
from bs4 import BeautifulSoup
import requests
website = 'http://spiegel.de/schlagzeilen'
r = requests.get(website)
soup = BeautifulSoup((r.content), "lxml")
div = soup.find("div", {"class": "schlagzeilen-content schlagzeilen-overview"})
for a in div.find_all('a', title=True):
print(a.text, a.find_next_sibling('span').text)
print(a.get('href'))
For writing to a csv I already have this...为了写入csv,我已经有了这个......
with open('%s_schlagzeilen.csv' % datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S.%f'), 'w', newline='',
encoding='utf-8') as file:
w = csv.writer(file, delimiter="|")
w.writerow([...])
..and need to know what's next to do. ..并且需要知道接下来要做什么。 THX!!谢谢!! in advance!提前!
You can collect all the desired extracted fields into a list of dictionaries and use thecsv.DictWriter
to write to the CSV file:您可以将所有需要提取的字段收集到字典列表中,并使用csv.DictWriter
写入 CSV 文件:
import csv
import datetime
from bs4 import BeautifulSoup
import requests
website = 'http://spiegel.de/schlagzeilen'
r = requests.get(website)
soup = BeautifulSoup((r.content), "lxml")
articles = []
for a in soup.select(".schlagzeilen-content.schlagzeilen-overview a[title]"):
category, published_at = a.find_next_sibling(class_="headline-date").get_text().split(",")
articles.append({
"Title": a.get_text(),
"URL": a.get('href'),
"Category": category.strip(" ()"),
"PublishedAt": published_at.strip(" ()")
})
filename = '%s_schlagzeilen.csv' % datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S.%f')
with open(filename, 'w', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=["Title", "URL", "Category", "PublishedAt"], )
writer.writeheader()
writer.writerows(articles)
Note how we are locating the categories and the "published at" - we need to go to the next sibling element and split the text by comma, stripping out the extra parenthesis.请注意我们如何定位类别和“发布于”——我们需要转到下一个同级元素并用逗号分隔文本,去掉额外的括号。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.