简体   繁体   中英

Issues while writing special characters to csv file

I am writing the crawled output of a webpage to CSV files. However few special characters such as 'hyphen' is not getting parsed correctly.

Original Text : Amazon Forecast - Now Generally Available

Result in csv : Amazon Forecast – Now Generally Available

I tried the below code

from bs4 import BeautifulSoup
from datetime import date
import requests
import csv
source = requests.get('https://aws.amazon.com/blogs/aws/').text
soup = BeautifulSoup(source, 'lxml')
# csv_file = open('aitrendsresults.csv', 'w')
csv_file = open('aws_cloud_results.csv', 'w' , encoding = 'utf8' )
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['title','img','src','summary'])
match = soup.find_all('div',class_='lb-row lb-snap')
for n in match:
 imgsrc= n.div.img.get('src')
 titlesrc= n.find('div',{'class':'lb-col lb-mid-18 lb-tiny-24'})
 titletxt= titlesrc.h2.text
 anchortxt= titlesrc.a.get('href')
 sumtxt= titlesrc.section.p.text
 print(sumtxt)
 csv_writer.writerow([titletxt,imgsrc,anchortxt,sumtxt])
csv_file.close()

Can you please help me to get the text like the same in original text provided above.

Create a function to handle ASCII characters (ie Hyphen, Semicolon) and pass the string as argument inside the function below:

def decode_ascii(string):
    return string.encode('ascii', 'ignore').decode('ascii')

input_text = 'Amazon Forecast - Now Generally Available'
output_text = decode_ascii(input_text)
print(output_text)

Output should be Amazon Forecast - Now Generally Available in the CSV.

I've been working with BS as well and I think you've only made a minor mistake. In line 8, where you open the csv file, the encoding should be "UTF-8" instead of "utf8". See if that helps.

Using title as test the following works for me

from bs4 import BeautifulSoup
import requests, csv

source = requests.get('https://aws.amazon.com/blogs/aws/').text
soup = BeautifulSoup(source, 'lxml')

with open("aws_cloud_results.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
    w = csv.writer(csv_file, delimiter = ";", quoting=csv.QUOTE_MINIMAL)
    w.writerow(['title'])
    match = soup.find_all('div',class_='lb-row lb-snap')
    for n in match:
        titlesrc= n.find('div',{'class':'lb-col lb-mid-18 lb-tiny-24'})
        titletxt= titlesrc.h2.text
        w.writerow([titletxt])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM