简体   繁体   English

python scrapy如何删除多余的解析字符

[英]python scrapy how to remove extra parsed character

During a parsing process using scrapy I have found this output 在使用scrapy的解析过程中,我找到了此输出

[u'TARTARINI AUTO SPA (CENTRALINO SELEZIONE PASSANTE)'],"[u'VCBONAZZI\\xa043', u'40013', u'CASTEL MAGGIORE']",[u'0516322411'],[u'info@tartariniauto.it'],[u'CARS (LPG INSTALLERS)'],[u'track.aspx?id=0&url=http://www.tartariniauto.it'] [u'TARTARINI AUTO SPA(CENTRALINO SELEZIONE PASSANTE)'],“ [u'VCBONAZZI \\ xa043',u'40013',u'CASTEL MAGGIORE']”,[u'0516322411'],[u'info @ tartariniauto。 it'],[u'CARS(LPG INSTALLERS)'],[u'track.aspx?id = 0&url = http://www.tartariniauto.it']

As you see there are some extra character like 如您所见,还有一些额外的字符,例如

u' \\xa043 " ' [ ] u'\\ xa043“'[]

Which I don't want . 我不想的。 How can I remove these ?? 我如何删除这些? Besides there are 5 items in this string . 此外,该字符串中还有5个项目。 I want the string look like this : 我希望字符串看起来像这样:

item1 , item2 , item3 , item4 , item5 item1,item2,item3,item4,item5

Here is my pipelines.py code 这是我的pipelines.py代码

from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join
import re
import json
import csv

class InfobelPipeline(object):
    def __init__(self):
      self.file = csv.writer(open('items.csv','wb'))
    def process_item(self, item, spider):
      name = item['name']
      address = item['address']
      phone = item['phone']
      email = item['email']
      category = item['category']
      website = item['website']
      self.file.writerow((name,address,phone,email,category,website))
    return item

Thanks 谢谢

The extra characters you're seeing are unicode strings. 您看到的多余字符是unicode字符串。 You'll see them a lot if you're scraping on the web. 如果您在网络上抓取,就会看到很多东西。 Common examples include copyright symbols: © unicode point U+00A9 , or trademark symbols ™ unicode point U+2122 . 常见示例包括版权符号:©unicode点U+00A9或商标符号™unicode点U+2122

The quickest way to remove them is to try to encode them to ascii and then throw them away if they're not ascii characters (which none of them are) 删除它们的最快方法是尝试将它们编码为ascii,如果它们不是ascii字符(它们都不是),则将其丢弃

>>> example = u"Xerox ™ printer"
>>> example
u'Xerox \u2122 printer'
>>> example.encode('ascii')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 6: ordinal 
not in range(128)
>>> example.encode('ascii', errors='ignore')
'Xerox  printer'
>>>

As you can see, when you try to decode the symbol to ascii it raises a UnicodeEncodeError because the character can't be represented in ascii. 如您所见,当您尝试将符号解码为ascii时,会引发UnicodeEncodeError因为该字符无法以ascii表示。 However, if you add the errors='ignore' keyword argument then it will simply ignore symbols it can't encode. 但是,如果您添加errors='ignore'关键字参数,则它将仅忽略无法编码的符号。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM