简体   繁体   English

'utf-8' 编解码器无法解码 position 中的字节 0x80 3131:无效的起始字节':在读取 xml 文件时

[英]'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte': while reading xml files

I want to define a function that can be implemented on each xml file in the directory in order to parse it and get the content from the tags in a dataframe.我想定义一个 function 可以在目录中的每个 xml 文件上实现,以便解析它并从 Z6A8064B5DF479455500553C47C5507 中的标签中获取内容


from xml.etree import ElementTree

def func(path, filename):

    for filename in os.listdir(path):
        with open(os.path.join(path, filename)) as file:
        # Read each line in the file, readlines() returns a list of lines
            content = file.readlines()
        # Combine the lines in the list into a string
            content = "".join(content)
            bs_content = bs(content, "lxml")

            headline = bs_content.find_all("headline")
            eventtitle = bs_content.find_all("eventtitle")
            city = bs_content.find_all("city")
            companyname = bs_content.find_all("companyname")
            companyticker = bs_content.find_all("companyticker")
            startdate = bs_content.find_all("startdate")
            eventstory = bs_content.find_all("eventstory")

            data = []
            for i in range(0,len(companyname)):
                rows = [companyname[i].get_text(),headline[i].get_text(),
                       city[i].get_text(),eventtitle[i].get_text(),
                       companyticker[i].get_text(),startdate[i].get_text(),
                      eventstory[i].get_text()]
                data.append(rows)
 
    df = pd.DataFrame(data,columns = ['companyname','headline',
                                  'city','eventtitle','companyticker',
                                  'startdate','eventstory'], dtype = float)

When I call a function I receive this error.当我调用 function 时,我收到此错误。 Unfortunately, every existing solutions didn't not work for me.不幸的是,每个现有的解决方案都不适用于我。

func('./Calls/', '1000015_T.xml')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Input In [58], in <module>
----> 1 func('./Calls/', '1000015_T.xml')

Input In [57], in func(path, filename)
      7 for filename in os.listdir(path):
      8     with open(os.path.join(path, filename)) as file:
      9     # Read each line in the file, readlines() returns a list of lines
---> 10         content = file.readlines()
     11     # Combine the lines in the list into a string
     12         content = "".join(content)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
    319 def decode(self, input, final=False):
    320     # decode input (taking the buffer into account)
    321     data = self.buffer + input
--> 322     (result, consumed) = self._buffer_decode(data, self.errors, final)
    323     # keep undecoded input until the next call
    324     self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

Maybe you can also help me with code optimization.也许您还可以帮助我进行代码优化。 My task is to get the content of 2k xml files and so far I decided to define a function and then to use pandarallel: parallel_apply(func)我的任务是获取 2k xml 文件的内容,到目前为止,我决定定义一个 function 然后使用 pandarallel:parallel_apply(func)

The input file is not UTF-8, it is likely some other code page.输入文件不是 UTF-8,可能是其他代码页。

Determine what the correct encoding is and alter your program accordingly.确定正确的编码是什么,并相应地更改您的程序。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 UnicodeDecodeError: &#39;utf-8&#39; 编解码器无法解码位置 3131 中的字节 0x80:起始字节无效 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte UnicodeDecodeError:&#39;utf-8&#39;编解码器无法解码位置3131中的字节0x80:我的代码中的无效起始字节 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte in my code UnicodeDecodeError:&#39;utf-8&#39;编解码器无法解码位置0的字节0x80:无效的起始字节 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte Python:UnicodeDecodeError:'utf-8'编解码器无法解码 position 中的字节 0x80 0:无效起始字节 - Python: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte - 有人可以帮我吗? - UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte - Some one can help me? 'utf-8' 编解码器无法解码 position 中的字节 0x80 28:起始字节无效 - 'utf-8' codec can't decode byte 0x80 in position 28: invalid start byte 在Windows上使用python错误:UnicodeDecodeError:&#39;utf-8&#39;编解码器无法解码位置110的字节0x80:无效的起始字节 - using python on windows error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 110: invalid start byte Python utf8编解码器无法解码位置103的字节0x80:无效的起始字节 - Python utf8 codec can't decode byte 0x80 in position 103:invalid start byte 错误:'utf8'编解码器无法解码位置0中的字节0x80:无效的起始字节 - Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte UnicodeDecodeError:&#39;utf8&#39;编解码器无法解码位置11的字节0x80:无效的起始字节 - UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 11: invalid start byte
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM