简体   繁体   中英

How to write to file non-English language using BeautifulSoup

I am learning webscraping via BeautifulSoup and Python. My first project is to extract certain recipes from cookpad.hu. I was successfully able to extract but now I'm having troubles with actually writing them to a file (csv is all I know how to do), due to this error:

Traceback (most recent call last): File "cookpad_scrape.py", line 24, in f.writerow(about_clean) UnicodeEncodeError: 'ascii' codec can't encode character u'\\xe1' in position 0: ordinal not in range(128)

My code is below. I am using Python 2.7.14 on Ubuntu. A pastebin of the webpage is here , but the webpage itself is this .

I'm assuming it can't write the Hungarian letters? I'm sure there is a terribly simple solution I am overlooking.

import requests
from bs4 import BeautifulSoup 
import csv 

'''
Tree of page:
    <div id="recipe main">
        <div id="editor" class="editor">
            <div id="about">
            <section id="ingredients">
            <section id="steps">
'''
#text only: soup.get_text()

page = requests.get('https://cookpad.com/hu/receptek/5040119-parazson-sult-padlizsankrem')
soup = BeautifulSoup(page.text, 'lxml')

f = csv.writer(open('recipes.csv', 'w')) #create and open file in f variable, using 'w' mode
f.writerow(['Recipe 1']) #write top row headings

about = soup.find(id='about')
about_ext = about.p.extract()
about_clean = about_ext.get_text()
f.writerow(about_clean)

ingredients = soup.find(id='ingredients')
ingredients_ext = ingredients.ol.extract()
ingredients_clean = ingredients_ext.find_all(itemprop='ingredients')
#for ingredient in ingredients_clean:

steps = soup.find(id='steps')
steps_p = steps.find_all(itemprop='recipeInstructions')
for step in steps_p:
    extracted = step.p.extract()
    print(extracted.text)
    f.writerow([extracted])

Solution: Run the script using python3, not 2 via python3 my_script.py

New problem: exporting the scrapes gets me good results for the steps, but ingredients and about section has each letter separated by commas .

You're running python2. In line 25 you're writing out the contents of 'about_clean' variable. You need to encode this value.

f.writerow(about_clean.encode("utf-8"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM