简体   繁体   English

再次:UnicodeEncodeError:ascii编解码器无法编码

[英]Again: UnicodeEncodeError: ascii codec can't encode

I have a folder of XML files that I would like to parse. 我有一个XML文件的文件夹,我想解析。 I need to get text out of the elements of these files. 我需要从这些文件的元素中获取文本。 They will be collected and printed to a CSV file where the elements are listed in columns. 它们将被收集并打印到CSV文件中,其中元素列在列中。

I can actually do this right now for some of my files. 其实我可以为我的一些文件做这个现在。 That is, for many of my XML files, the process goes fine, and I get the output I want. 也就是说,对于我的许多XML文件,这个过程很顺利,我得到了我想要的输出。 The code that does this is: 执行此操作的代码是:

import os, re, csv, string, operator
import xml.etree.cElementTree as ET
import codecs
def parseEO(doc):
    #getting the basic structure
    tree = ET.ElementTree(file=doc)
    root = tree.getroot()
    agencycodes = []
    rins = []
    titles =[]
    elements = [agencycodes, rins, titles]
    #pulling in the text from the fields
    for elem in tree.iter():
        if elem.tag == "AGENCY_CODE":
            agencycodes.append(int(elem.text))
        elif elem.tag == "RIN":
            rins.append(elem.text)
        elif elem.tag == "TITLE":
            titles.append(elem.text)
    with open('parsetest.csv', 'w') as f:
        writer = csv.writer(f)
        writer.writerows(zip(*elements))


parseEO('EO_file.xml')     

However, on some versions of the input file, I get the infamous error: 但是,在某些版本的输入文件中,我得到了臭名昭着的错误:

'ascii' codec can't encode character u'\x97' in position 32: ordinal not in range(128)

The full traceback is: 完整的追溯是:

    ---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-15-28d095d44f02> in <module>()
----> 1 execfile(r'/parsingtest.py') # PYTHON-MODE

/Users/ian/Desktop/parsingtest.py in <module>()
     91         writer.writerows(zip(*elements))
     92 
---> 93 parseEO('/EO_file.xml')
     94 
     95 

/parsingtest.py in parseEO(doc)
     89     with open('parsetest.csv', 'w') as f:
     90         writer = csv.writer(f)
---> 91         writer.writerows(zip(*elements))
     92 
     93 parseEO('/EO_file.xml')
UnicodeEncodeError: 'ascii' codec can't encode character u'\x97' in position 32: ordinal not in range(128)

I am fairly confident from reading the other threads that the problem is in the codec being used (and, you know, the error is pretty clear on that as well). 我非常有信心通过阅读其他线程来解决问题在于所使用的编解码器(并且,您知道,错误也非常清楚)。 However, the solutions I have read haven't helped me (emphasized because I understand I am the source of the problem, not the way people have answered in the past). 但是,我读过的解决方案并没有帮助 (强调因为我知道我是问题的根源,而不是过去人们回答的方式)。

Several repsonses (such as: this one and this one and this one ) don't deal directly with ElementTree, and I'm not sure how to translate the solutions into what I'm doing. 几个repsonses(例如: 这一个这一个这一个 )不直接处理ElementTree,我不知道如何将解决方案转化为我正在做的事情。

Other solutions that do deal with ElementTree (such as: this one and this one ) are either using a short string (the first link here) or are using the .tostring/.fromstring methods in ElementTree which I do not. 其他处理ElementTree的解决方案(例如: 这一个这个 )要么使用短字符串(这里是第一个链接),要么使用ElementTree中的.tostring / .fromstring方法,我不这样做。 (Though, of course, perhaps I should be.) (当然,也许我应该是。)

Things I have tried that didn't work: 我试过的东西不起作用:

  1. I have attempted to bring in the file via UTF-8 encoding: 我试图通过UTF-8编码引入文件:

     infile = codecs.open('/EO_file.xml', encoding="utf-8") parseEO(infile) 

    but I think the ElementTree process already understands it to be UTF-8 (which is noted in the first line of all the XML files I have), and so this is not only not correct, but is actually redundantly bad all over again. 但是我认为ElementTree进程已经将它理解为UTF-8(我在所有XML文件的第一行中都注明了这一点),所以这不仅不正确,而且实际上又是冗余的错误。

  2. I attempted to declare an encoding process within the loop, replacing: 我试图在循环中声明一个编码过程,替换:

     tree = ET.ElementTree(file=doc) 

    with

     parser = ET.XMLParser(encoding="utf-8") tree = ET.parse(doc, parser=parser) 

    in the loop above that does work. 在上面的循环中确实有效。 This didn't work for me either. 这对我也不起作用。 The same files that worked before still worked, the same files that created the error still created the error. 之前工作的相同文件仍然有效,创建错误的相同文件仍然会产生错误。

There have been a lot of other random attempts, but I won't belabor the point. 已经有很多其他的随机尝试,但我不会强调这一点。

So, while I assume the code I have is both inefficient and offensive to good programming style, it does do what I want for several files. 所以,虽然我认为我所拥有的代码既低效又缺乏良好的编程风格,但它确实可以满足我对多个文件的需求。 I am trying to understand if there is simply an argument I'm missing that I don't know about, if I should somehow pre-process the files (I have not identified where the offending character is, but do know that u'\\x97 translates to a control character of some kind), or some other option. 我试图理解是否只有一个我不知道的参数,我不知道,如果我应该以某种方式预处理文件(我还没有确定哪个有问题的字符,但确实知道你的'' x97转换为某种控制字符,或其他一些选项。

You are parsing XML; 你正在解析XML; the XML API hands you unicode values. XML API为您提供unicode值。 You are then attempting to write the unicode data to a CSV file without encoding it first. 然后,您尝试将unicode数据写入CSV文件, 而不先对其进行编码。 Python then attempts to encode it for you but fails. Python然后尝试为您编码但失败。 You can see this in your traceback, it is the .writerows() call that fails, and the error tells you that encoding is failing, and not decoding (parsing the XML). 您可以在回溯中看到这一点,它是失败的.writerows()调用,并且错误告诉您编码失败,而不是解码(解析XML)。

You need to choose an encoding, then encode your data before writing: 您需要选择编码,然后在写入之前对数据进行编码:

for elem in tree.iter():
    if elem.tag == "AGENCY_CODE":
        agencycodes.append(int(elem.text))
    elif elem.tag == "RIN":
        rins.append(elem.text.encode('utf8'))
    elif elem.tag == "TITLE":
        titles.append(elem.text.encode('utf8'))

I used the UTF8 encoding because it can handle any Unicode code point, but you need to make your own, explicit choice. 我使用UTF8编码,因为它可以处理任何Unicode代码点,但您需要自己做出明确的选择。

It sounds like you have a unicode character somewhere in your xml file. 听起来你的xml文件中有一个unicode字符。 Unicode is different than a string that is encoded utf8. Unicode与编码为utf8的字符串不同。

The python2.7 csv library doesn't support unicode characters so you'll have to run the data through a function that encodes them before you dump them into your csv file. python2.7 csv库不支持unicode字符,因此在将数据转储到csv文件之前,必须通过编码它们的函数运行数据。

def normalize(s):
    if type(s) == unicode: 
        return s.encode('utf8', 'ignore')
    else:
        return str(s)

so your code would look like this: 所以你的代码看起来像这样:

for elem in tree.iter():
    if elem.tag == "AGENCY_CODE":
        agencycodes.append(int(elem.text))
    elif elem.tag == "RIN":
        rins.append(normalize(elem.text))
    elif elem.tag == "TITLE":
        titles.append(normalize(elem.text))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 再次,UnicodeEncodeError(ascii编解码器无法编码) - Once again, UnicodeEncodeError (ascii codec can't encode) UnicodeEncodeError:&#39;ascii&#39;编解码器无法编码字符 - UnicodeEncodeError: 'ascii' codec can't encode characters UnicodeEncodeError:“ ascii”编解码器无法编码 - UnicodeEncodeError: 'ascii' codec can't encode UnicodeEncodeError:&#39;ascii&#39;编解码器不能编码字符[...] - UnicodeEncodeError: 'ascii' codec can't encode character […] UnicodeEncodeError:&#39;ascii&#39;编解码器无法编码字符 - UnicodeEncodeError: 'ascii' codec can't encode characte UnicodeEncodeError:&#39;ascii&#39;编解码器不能编码字符u&#39;\\ xe4&#39; - UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' Python3中的“ UnicodeEncodeError:&#39;ascii&#39;编解码器无法编码字符” - “UnicodeEncodeError: 'ascii' codec can't encode character” in Python3 UnicodeEncodeError:&#39;ascii&#39;编解码器不能编码字符u&#39;\\ xef&#39; - UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' UnicodeEncodeError:将csv导出到mysql时&#39;ascii&#39;编解码器无法编码 - UnicodeEncodeError: 'ascii' codec can't encode while exporting csv to mysql 收到UnicodeEncodeError的Python脚本:“ ascii”编解码器无法编码字符 - Python script receiving a UnicodeEncodeError: 'ascii' codec can't encode character
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM