UnicodeDecodeError when reading data from DBF database

Question

I need to write a script that connects an ERP program to a manufacturing program. With the production program the matter is clear - I send it data via HTTP requests. It is worse with the ERP program, because in its case, the data must be read from a DBF file.

I use the dbf library because (if I'm not mistaken) it's the only one that provides the ability to filter data in a fairly simple and fast way. I open the database this way

table = dbf.Table(path).open()
dbf_index = dbf.pql(table, "select * where ident == 'M'")

I then loop through each successive record that the query returned. I need to "package" the selected data from the DBF database into json and send it to the production program api.

data = {
    "warehouse_id" : parseDbfData(record['SYMBOL']),
    "code" : parseDbfData(record['SYMBOL']),
    "name" : parseDbfData(record['NAZWA']),
    "main_warehouse" : False,
    "blocked" : False
}

The parseDbfData function looks like this, but it's not the one causing the problem because it didn't work the same way without it. I added it trying to fix the problem.

def parseDbfData(data):
    return str(data.strip())

When run, if the function encounters any "mismatching" character from DBF database (eg any Polish characters ie ą, ę, ś, ć) the script terminates with an error

UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 15: ordinal not in range(128)

The error points to a line containing this (in building json)

"name" : parseDbfData(record['NAZWA']),

The value the script is trying to read at this point is probably " Magazyn materiałów Podgórna ". As you can see, this value contains the characters " ł " and " ó ". I think this makes the whole script break but I don't know how to fix it.

I'll mention that I'm using Python version 3.9. I know that there were character encoding issues in versions 2. , but I thought that the Python 3. era had remedied that problem. I found out it didn't:(

Answer 1

I came to the conclusion that I have to use encoding directly when reading the DBF database. However, I could not read from the documentation, how exactly to do this.

After a thorough analysis of the dbf module itself, I came to the conclusion that I need to use the codepage parameter when opening the database. A moment of combining and I was able to determine that of all the encoding standards available in the module, cp852 suits me best.

After the correction, the code to open a DBF database looks like this:

table = dbf.Table(path, codepage='cp852').open()

Answer 2

Python 3 did fix the unicode/bytes issue, but only for Python itself. The dbf format stores the code page that should be used inside the .dbf files themselves (which is frequently not done, resulting in an ascii codec being used).

To fix the dbf files (which may mess up the other programs using them, so test carefully):

table.open()
table.codepage = dbf.CodePage('cp852')
table.close()

UnicodeDecodeError when reading data from DBF database

Question

2 answers

solution1
2 ACCPTED 2021-01-29 14:34:04

solution2
2 2021-01-29 16:40:35

UnicodeDecodeError when reading data from DBF database

Question

2 answers

solution1 2 ACCPTED 2021-01-29 14:34:04

solution2 2 2021-01-29 16:40:35

solution1
2 ACCPTED 2021-01-29 14:34:04

solution2
2 2021-01-29 16:40:35