python scrapy如何删除多余的解析字符

Question

During a parsing process using scrapy I have found this output 在使用scrapy的解析过程中，我找到了此输出

[u'TARTARINI AUTO SPA (CENTRALINO SELEZIONE PASSANTE)'],"[u'VCBONAZZI\\xa043', u'40013', u'CASTEL MAGGIORE']",[u'0516322411'],[u'info@tartariniauto.it'],[u'CARS (LPG INSTALLERS)'],[u'track.aspx?id=0&url=http://www.tartariniauto.it'] [u'TARTARINI AUTO SPA（CENTRALINO SELEZIONE PASSANTE）']，“ [u'VCBONAZZI \\ xa043'，u'40013'，u'CASTEL MAGGIORE']”，[u'0516322411']，[u'info @ tartariniauto。 it']，[u'CARS（LPG INSTALLERS）']，[u'track.aspx？id = 0＆url = http：//www.tartariniauto.it']

As you see there are some extra character like 如您所见，还有一些额外的字符，例如

u' \\xa043 " ' [ ] u'\\ xa043“'[]

Which I don't want . 我不想的。 How can I remove these ?? 我如何删除这些？ Besides there are 5 items in this string . 此外，该字符串中还有5个项目。 I want the string look like this : 我希望字符串看起来像这样：

item1 , item2 , item3 , item4 , item5 item1，item2，item3，item4，item5

Here is my pipelines.py code 这是我的pipelines.py代码

from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join
import re
import json
import csv

class InfobelPipeline(object):
    def __init__(self):
      self.file = csv.writer(open('items.csv','wb'))
    def process_item(self, item, spider):
      name = item['name']
      address = item['address']
      phone = item['phone']
      email = item['email']
      category = item['category']
      website = item['website']
      self.file.writerow((name,address,phone,email,category,website))
    return item

Thanks 谢谢

Answer 1

The extra characters you're seeing are unicode strings. 您看到的多余字符是unicode字符串。 You'll see them a lot if you're scraping on the web. 如果您在网络上抓取，就会看到很多东西。 Common examples include copyright symbols: © unicode point U+00A9 , or trademark symbols ™ unicode point U+2122 . 常见示例包括版权符号：©unicode点U+00A9或商标符号™unicode点U+2122 。

The quickest way to remove them is to try to encode them to ascii and then throw them away if they're not ascii characters (which none of them are) 删除它们的最快方法是尝试将它们编码为ascii，如果它们不是ascii字符（它们都不是），则将其丢弃

>>> example = u"Xerox ™ printer"
>>> example
u'Xerox \u2122 printer'
>>> example.encode('ascii')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 6: ordinal 
not in range(128)
>>> example.encode('ascii', errors='ignore')
'Xerox  printer'
>>>

As you can see, when you try to decode the symbol to ascii it raises a UnicodeEncodeError because the character can't be represented in ascii. 如您所见，当您尝试将符号解码为ascii时，会引发UnicodeEncodeError因为该字符无法以ascii表示。 However, if you add the errors='ignore' keyword argument then it will simply ignore symbols it can't encode. 但是，如果您添加errors='ignore'关键字参数，则它将仅忽略无法编码的符号。

python scrapy如何删除多余的解析字符

问题描述

1 个解决方案

解决方案1
5 已采纳 2012-05-01 17:22:41

python scrapy如何删除多余的解析字符

问题描述

1 个解决方案

解决方案1 5 已采纳 2012-05-01 17:22:41

解决方案1
5 已采纳 2012-05-01 17:22:41