简体   繁体   中英

How to export data from a beautifulsoup scrape to a csv file

I found this code online, and was wondering how to export the data collected to a csv file.

html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.body.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("       "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

The code you have simply extracts all text from the given URL. This loses any structure, making it very difficult to determine where the text you want starts and ends.

On the page you have given, you could for example extract all of the headlines by looking at the HTML source and determining that the 5 stories all have unique HTML ids. With the you can use soup() to find these and extract the text from them. Now you have a headline and a summary for each article, which could then be written into a CSV file. The following has been tested using Python 3.5.2:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv

html = urlopen("http://www.thestar.com.my/news/nation/")
soup = BeautifulSoup(html, "html.parser")

# IDs found by looking at the HTML source in a browser
ids = [
    "slcontent3_3_ileft_0_hlFirstStory", 
    "slcontent3_3_ileft_0_hlSecondStory",
    "slcontent3_3_ileft_0_lvStoriesRight_ctrl0_hlStoryRight",
    "slcontent3_3_ileft_0_lvStoriesRight_ctrl1_hlStoryRight",
    "slcontent3_3_ileft_0_lvStoriesRight_ctrl2_hlStoryRight"]

with open("news.csv", "w", newline="", encoding='utf-8') as f_news:
    csv_news = csv.writer(f_news)
    csv_news.writerow(["Headline", "Summary"])

    for id in ids:
        headline = soup.find("a", id=id)
        summary = headline.find_next("p") 
        csv_news.writerow([headline.text, summary.text])

Which would give you a CSV file as follows:

Headline,Summary
Many say convicted serial rapist Selva still considered âa person of high riskâ,PETALING JAYA: Convicted serial rapist Selva Kumar Subbiah will be back in the country from Canada in three days and a policeman who knows him says there is no guarantee that he will not strike again.
Liow: Way too many road accidents,"PETALING JAYA: Road accidents took the lives of 7,152 and incurred a loss of about RM9.2bil in Malaysia last year, says Datuk Seri Liow Tiong Lai."
Ex-civil servant wins RM27.4mil jackpot,PETALING JAYA: It was the ang pow of his life.
"Despite latest regulation, many still puff away openly at parks and R&R;","KUALA LUMPUR: It was another cloudy afternoon when office workers hung out at the popular KLCC park, puffing away at the end of lunch hour, oblivious to the smoking ban there."
Police warn groups not to cause disturbances on Thaipusam,GEORGE TOWN: Police have warned supporters of the golden and silver chariots against provo­king each other during the Thaipusam celebration next week.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM