简体   繁体   English

从 DBF 数据库读取数据时出现 UnicodeDecodeError

[英]UnicodeDecodeError when reading data from DBF database

I need to write a script that connects an ERP program to a manufacturing program.我需要编写一个脚本,将 ERP 程序连接到制造程序。 With the production program the matter is clear - I send it data via HTTP requests.有了生产程序,事情就很清楚了——我通过 HTTP 请求向它发送数据。 It is worse with the ERP program, because in its case, the data must be read from a DBF file. ERP 程序的情况更糟,因为在这种情况下,数据必须从 DBF 文件中读取。

I use the dbf library because (if I'm not mistaken) it's the only one that provides the ability to filter data in a fairly simple and fast way.我使用 dbf 库是因为(如果我没记错的话)它是唯一能够以相当简单和快速的方式过滤数据的库。 I open the database this way我这样打开数据库

table = dbf.Table(path).open()
dbf_index = dbf.pql(table, "select * where ident == 'M'")

I then loop through each successive record that the query returned.然后我遍历查询返回的每个连续记录。 I need to "package" the selected data from the DBF database into json and send it to the production program api.我需要把DBF数据库中选中的数据“打包”成json发送给生产程序api。

data = {
    "warehouse_id" : parseDbfData(record['SYMBOL']),
    "code" : parseDbfData(record['SYMBOL']),
    "name" : parseDbfData(record['NAZWA']),
    "main_warehouse" : False,
    "blocked" : False
}

The parseDbfData function looks like this, but it's not the one causing the problem because it didn't work the same way without it. parseDbfData function 看起来像这样,但它不是导致问题的原因,因为没有它它不会以相同的方式工作。 I added it trying to fix the problem.我添加它试图解决问题。

def parseDbfData(data):
    return str(data.strip())

When run, if the function encounters any "mismatching" character from DBF database (eg any Polish characters ie ą, ę, ś, ć) the script terminates with an error运行时,如果 function 遇到 DBF 数据库中的任何“不匹配”字符(例如任何波兰语字符,即 ą、ę、ś、ć),脚本将终止并出现错误

UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 15: ordinal not in range(128)

The error points to a line containing this (in building json)错误指向包含此的行(在构建 json 中)

"name" : parseDbfData(record['NAZWA']),

The value the script is trying to read at this point is probably " Magazyn materiałów Podgórna ".此时脚本试图读取的值可能是“ Magazyn materiałów Podgórna ”。 As you can see, this value contains the characters " ł " and " ó ".如您所见,该值包含字符“ ł ”和“ ó ”。 I think this makes the whole script break but I don't know how to fix it.我认为这会使整个脚本中断,但我不知道如何修复它。

I'll mention that I'm using Python version 3.9.我会提到我使用的是 Python 3.9 版。 I know that there were character encoding issues in versions 2. , but I thought that the Python 3. era had remedied that problem.我知道版本 2. 中存在字符编码问题,但我认为 Python 3.时代已经解决了这个问题。 I found out it didn't:(我发现它没有:(

I came to the conclusion that I have to use encoding directly when reading the DBF database.我得出的结论是,读取DBF数据库必须直接使用encoding。 However, I could not read from the documentation, how exactly to do this.但是,我无法从文档中读到具体如何执行此操作。

After a thorough analysis of the dbf module itself, I came to the conclusion that I need to use the codepage parameter when opening the database.仔细分析了dbf模块本身,得出的结论是打开数据库需要用到codepage参数。 A moment of combining and I was able to determine that of all the encoding standards available in the module, cp852 suits me best.结合了一下,我能够确定模块中所有可用的编码标准, cp852最适合我。

After the correction, the code to open a DBF database looks like this:更正后,打开 DBF 数据库的代码如下所示:

table = dbf.Table(path, codepage='cp852').open()

Python 3 did fix the unicode/bytes issue, but only for Python itself. Python 3 确实修复了 unicode/bytes 问题,但仅针对 Python 本身。 The dbf format stores the code page that should be used inside the .dbf files themselves (which is frequently not done, resulting in an ascii codec being used). dbf格式存储应在.dbf文件本身内部使用的代码页(通常不这样做,导致使用ascii编解码器)。

To fix the dbf files (which may mess up the other programs using them, so test carefully):要修复 dbf 文件(这可能会弄乱其他使用它们的程序,所以请仔细测试):

table.open()
table.codepage = dbf.CodePage('cp852')
table.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM