简体   繁体   中英

Scrapy Python spider: Storing results in Latin-1, not in unicode

Currently my spider fetches results as needed but encodes them in unicode (UTF-8, I believe). When I save these results to a csv, I have a ton of cleaning to do as a result, with all the [u' & other characters that Scrapy inserts.

How exactly would I store the results as Latin characters, & not unicode. Where exactly would I need to make the change?

Thanks. -TM

The item_extracted is of type unicode. You can either encode it to latin where it's extracted (in the parse function) or in an item pipeline or output processor

Easiest way is to add this line to your parse function

item_to_be_stored = item_extracted.encode('latin-1','ignore')

Or you could define a function in your item class.

from scrapy.utils.python import unicode_to_str

def u_to_str(text):
    unicode_to_str(text,'latin-1','ignore')

class YourItem(Item):
    name = Field(output_processor=u_to_str())

If your problem is what you say it is, the solution is as simple as casting to a string.

>>> a = u'spam and eggs'
>>> a
u'spam and eggs'
>>> type(a)
<type 'unicode'>
>>> b = str(a)
>>> b
'spam and eggs'
>>> type(b)
<type 'str'>

EDIT: Knowing that an exception could occur it might be a good idea to wrap this in a try and except

try:
    str(a)
except UnicodeError:
    print "Skipping string %s" % a

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM