繁体   English   中英

如何将数据从BeautifulSoup抓取文件导出到CSV文件

[英]How to export data from a beautifulsoup scrape to a csv file

我在网上找到了此代码,并且想知道如何将收集的数据导出到csv文件中。

html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.body.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("       "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

您所拥有的代码只是从给定URL中提取所有文本。 这会丢失任何结构,因此很难确定所需文本的开始和结束位置。

在给出的页面上,例如,可以通过查看HTML源代码并确定5个故事均具有唯一的HTML ID来提取所有标题。 使用,您可以使用soup()找到它们并从中提取文本。 现在,您有每篇文章的标题和摘要,然后可以将其写入CSV文件。 以下内容已使用Python 3.5.2进行了测试:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv

html = urlopen("http://www.thestar.com.my/news/nation/")
soup = BeautifulSoup(html, "html.parser")

# IDs found by looking at the HTML source in a browser
ids = [
    "slcontent3_3_ileft_0_hlFirstStory", 
    "slcontent3_3_ileft_0_hlSecondStory",
    "slcontent3_3_ileft_0_lvStoriesRight_ctrl0_hlStoryRight",
    "slcontent3_3_ileft_0_lvStoriesRight_ctrl1_hlStoryRight",
    "slcontent3_3_ileft_0_lvStoriesRight_ctrl2_hlStoryRight"]

with open("news.csv", "w", newline="", encoding='utf-8') as f_news:
    csv_news = csv.writer(f_news)
    csv_news.writerow(["Headline", "Summary"])

    for id in ids:
        headline = soup.find("a", id=id)
        summary = headline.find_next("p") 
        csv_news.writerow([headline.text, summary.text])

这将为您提供一个CSV文件,如下所示:

Headline,Summary
Many say convicted serial rapist Selva still considered âa person of high riskâ,PETALING JAYA: Convicted serial rapist Selva Kumar Subbiah will be back in the country from Canada in three days and a policeman who knows him says there is no guarantee that he will not strike again.
Liow: Way too many road accidents,"PETALING JAYA: Road accidents took the lives of 7,152 and incurred a loss of about RM9.2bil in Malaysia last year, says Datuk Seri Liow Tiong Lai."
Ex-civil servant wins RM27.4mil jackpot,PETALING JAYA: It was the ang pow of his life.
"Despite latest regulation, many still puff away openly at parks and R&R;","KUALA LUMPUR: It was another cloudy afternoon when office workers hung out at the popular KLCC park, puffing away at the end of lunch hour, oblivious to the smoking ban there."
Police warn groups not to cause disturbances on Thaipusam,GEORGE TOWN: Police have warned supporters of the golden and silver chariots against provo­king each other during the Thaipusam celebration next week.

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM