Scrapy Python spider: Storing results in Latin-1, not in unicode

Question

Currently my spider fetches results as needed but encodes them in unicode (UTF-8, I believe). When I save these results to a csv, I have a ton of cleaning to do as a result, with all the [u' & other characters that Scrapy inserts.

How exactly would I store the results as Latin characters, & not unicode. Where exactly would I need to make the change?

Thanks. -TM

Answer 1

The item_extracted is of type unicode. You can either encode it to latin where it's extracted (in the parse function) or in an item pipeline or output processor

Easiest way is to add this line to your parse function

item_to_be_stored = item_extracted.encode('latin-1','ignore')

Or you could define a function in your item class.

from scrapy.utils.python import unicode_to_str

def u_to_str(text):
    unicode_to_str(text,'latin-1','ignore')

class YourItem(Item):
    name = Field(output_processor=u_to_str())

Answer 2

If your problem is what you say it is, the solution is as simple as casting to a string.

>>> a = u'spam and eggs'
>>> a
u'spam and eggs'
>>> type(a)
<type 'unicode'>
>>> b = str(a)
>>> b
'spam and eggs'
>>> type(b)
<type 'str'>

EDIT: Knowing that an exception could occur it might be a good idea to wrap this in a try and except

try:
    str(a)
except UnicodeError:
    print "Skipping string %s" % a

Scrapy Python spider: Storing results in Latin-1, not in unicode

Question

2 answers

solution1
4 2011-06-28 05:59:59

solution2
1 2011-06-28 02:10:07

Scrapy Python spider: Storing results in Latin-1, not in unicode

Question

2 answers

solution1 4 2011-06-28 05:59:59

solution2 1 2011-06-28 02:10:07

solution1
4 2011-06-28 05:59:59

solution2
1 2011-06-28 02:10:07