简体   繁体   中英

Python 3 - Hebrew coding problems

I have a server written in FastAPI, Python 3.8.13 that receives data through a form from an external service, which may include Hebrew letters. Until recently, the data arrived through a back proxy server that was written in Python 2.7 and everything worked well. The proxy code was look something like this:

from requests import post
from bottle import route, request
@route('/', method='POST')
def new_req():
    post('https://....', dict(request.forms))
    return ''

The backend server code looks something like this:

@app.post('/....')
async def get_message(request:Request, message:str=Form(...)):
    print(message)

As long as the data arrived via the proxy, there were no problems with Hebrew letters. From the moment we asked the service to transfer the messages directly to the backend server, string like: 'שלום' were seen: 'שלו×'. The service declares that the data in Hebrew is sent in Unicode, so I tried to do something like this:

@app.post('/....')
async def get_message(request:Request, message:bytes=Form(...)):
    print(message)
    message = message.decode('utf-8')
    print(message)

The result (for: 'שלום'):

b'\xc3\x97\xc2\xa9\xc3\x97\xc2\x9c\xc3\x97\xc2\x95\xc3\x97\xc2\x9d'
ש×××

I tried replacing 'utf-8' with different encodings, international or Hebrew, and each time I got new kind of gibberish. Does anyone have an idea what else to try?

This is a “double-UTF-8-encoded” string. That is, you start with the sequence of characters:

  1. U+05E9 HEBREW LETTER SHIN
  2. U+05DC HEBREW LETTER LAMED
  3. U+05D5 HEBREW LETTER VAV
  4. U+05DD HEBREW LETTER FINAL MEM

Encode them into UTF-8,

b'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'

Misinterpret this byte string as being ISO-8859-1 encoded, producing the nonsense string

ש×××

And encode that string into UTF-8, which becomes the observed byte sequence.

b'\xc3\x97\xc2\xa9\xc3\x97\xc2\x9c\xc3\x97\xc2\x95\xc3\x97\xc2\x9d'

So, you need to figure out where this redundant encode operation is occurring, and find a way to tell your program “This bytes object is already UTF-8, so don't re-encode it.”

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM