简体   繁体   中英

OEM non printable characters in Python strings

I´m trying to port some Delphi code that sends data to a Universe database. In order to make the text legible by the DB we need to encode it in OEM.

In Delphi is done this way:

    procedure TForm1.GenerarTablasNLS;
    var
      i: integer;
    begin
      for i := 0 to 255 do
      begin
        TablaUV_NLS[i] := AnsiChar(i);
        TablaNLS_UV[i] := AnsiChar(i);   
      end;
      // Nulo final
      TablaUV_NLS[256] := #0;
      TablaNLS_UV[256] := #0;

      OemToCharA(@TablaUV_NLS[1], @TablaUV_NLS[1]);
      CharToOemA(@TablaNLS_UV[1], @TablaNLS_UV[1]);

And then we translate our text simply like this

    function StringToUniverse(const Value: string): AnsiString;
    var
      p: PChar;
      q: PAnsiChar;
    begin
      SetLength(Result, Length(Value));
      if Value = '' then Exit;

      p := Pointer(Value);
      q := Pointer(Result);
      while p^ <> #0 do
      begin
        q^ := TablaNLS_UV[Ord(AnsiChar(p^))];
        Inc(p);
        Inc(q);
      end;
    end;

I follow the same logic in Python using a dictionary that stores each character translation


class StringUniverseDict(dict):
    def __missing__(self, key):
        return key

TablaString2UV = StringUniverseDict()

def rellenar_tablas_codificacion():
    TablaString2UV['á'] = ' '       # chr(225) = chr(160)
    TablaString2UV['é'] = '‚'       # chr(233) = chr(130)
    TablaString2UV['í'] = '¡'       # chr(237) = chr(161)
    TablaString2UV['ó'] = '¢'       # chr(243) = chr(162)
    TablaString2UV['ú'] = '£'       # chr(250) = chr(163)
    TablaString2UV['ñ'] = '¤'       # chr(241) = chr(164)
    TablaString2UV['ç'] = '‡'       # chr(231) = chr(135)
    TablaString2UV['Á'] = 'µ'       # chr(193) = chr(181)
    TablaString2UV['É'] = chr(144)  # chr(201) = chr(144)     
    TablaString2UV['Í'] = 'Ö'       # chr(205) = chr(214)
    TablaString2UV['Ó'] = 'à'       # chr(211) = chr(224)
    TablaString2UV['Ñ'] = '¥'       # chr(209) = chr(165)
    TablaString2UV['Ç'] = '€'       # chr(199) = chr(128)
    TablaString2UV['ü'] = chr(129)  # chr(252) = chr(129)     

    TablaString2UV[chr(129)] = '_'  # chr(129) = chr(095)     
    TablaString2UV[chr(141)] = '_'  # chr(141) = chr(095)  
    TablaString2UV['•'] = chr(007)  # chr(149) = chr(007)  
    TablaString2UV['Å'] = chr(143)  # chr(197) = chr(143)     
    TablaString2UV['Ø'] = chr(157)  # chr(216) = chr(157)     
    TablaString2UV['ì'] = chr(141)  # chr(236) = chr(141)    

This works "fine" as long as I translate using printable characters. For example, the string

"á é í ó ú ñ ç Á Í Ó Ú Ñ Ç"

is translated, in Delphi, to the following bytes:

0xa0 0x20 0x82 0x20 0xa1 0x20 0xa2 0x20 0xa3 0x20 0xa4 0x20 0x87 0x20 0xb5 0x20 0xd6 0x20 0xe0 0x20 0xe9 0x20 0xa5 0x20 0x80 0xfe 0x73 0x64 0x73

(á translates to ' ', which is chr(160) or 0xA0 in hexa. é is '‚' or chr(130), 0x82 in hexa, í is '¡', char(161) or 0xA1 in hexa and so on)

In Python, when I try to encode this to OEM I do the following:

def convertir_string_a_universe(cadena_python):
    resultado = ''
    for letra in cadena_python:
        resultado += TablaString2UV[letra]
    return resultado

And then, to get the bytes

txt_registro = convertir_string_a_universe(txt_orig)
datos = bytes(txt_registro, 'cp1252')

With this I get the following bytes:

b'\xa0 \x82 \xa1 \xa2 \xa3 \xa4 \x87 \xb5 \xd6 \xe0 \xe9 \xa5 \x80 \x9a'

My problem is that this OEM encoding uses non-printable characters, like in 'É' = chr(144) (0x90 in hexa). If I try to call bytes(txt_registro, 'cp1252') with an array where I hava translated 'É' into chr(0x90) I get this error:

caracteres_mal = 'Éü'
txt_registro = convertir_string_a_universe(txt_orig)
datos = bytes(txt_registro, 'cp1252')

  File "C:\Users\Hector\PyCharmProjects\pyuniverse\pyuniverse\UniverseRegister.py", line 138, in reconstruir_registro_universe
    datos = bytes(txt_registro, 'cp1252')
  File "C:\Users\Hector\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character '\x90' in position 0: character maps to <undefined>

How can I do this OEM encoding without raising this UnicodeEncodeError?

This is because cp1252 does not know about chr(0x90) . If you try with utf-8 instead, it will work.

>>> chr(0x90).encode("utf8")
b'\xc2\x90'

I don't understand why you are trying to convert to cp1252 though: you have applied a custom conversion map and then, with bytes(txt_registro, 'cp1252') , you are converting your result again to cp1552 .

I think what you are looking for is something like:

datos = bytes(txt_orig, 'uv')

where uv is your cutom codec.

So you would have to write an encoder and a decoder for it (which is basically what you have done already). Take a look at https://docs.python.org/3/library/codecs.html#codecs.register to register a new codec. The function you will register with it should return a CodecInfo object described upper in the documentation.

import codecs

def buscar_a_uv(codec):
    if codec == "uv":
        return codecs.CodecInfo(
            convertir_string_a_universe, convertir_universe_a_string, name="uv")
    else:
        return None

codecs.register(buscar_a_uv)
datos = bytes(txt_orig, 'uv')

EDIT

The encoder/decoder functions should return bytes, so you would need to update convertir_string_a_universe a bit.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM