简体   繁体   English

编码和解码 - 希伯来语字符串看起来像乱码 - Teradata 的 Python 3

[英]encode and decode - Hebrew string look like gibberish - Python 3 from Teradata

I connected to my Teradata DB and extract simple query result to a dataframe.我连接到我的 Teradata 数据库并将简单查询结果提取到 dataframe。 The dataframe (df_clients) has 1 column (FirstName) and 1 row with Hebrew string ('שלום לך') as I print the dataframe I get gibberish instead of Hebrew language: ùìåí ìê dataframe (df_clients) 有 1 列 (FirstName) 和 1 行带有希伯来语字符串 ('שלום לך'),因为我打印 dataframe 我得到胡言乱语而不是希伯来语:ùìåí

I find a solution to encode and decode string:我找到了编码和解码字符串的解决方案:

strr = "ùìåí ìê"

print( strr) #not good

print( strr.encode('cp1252').decode('cp1255',errors='replace')) #good

it worked.. but when I tried the same solution with Pandas Dataframe it doesn't work (no error, but don't work):它有效..但是当我尝试使用 Pandas Dataframe 相同的解决方案时它不起作用(没有错误,但不起作用):

df_clients.FirstName.apply(lambda x : x.encode('cp1252').decode('cp1255',errors='replace') )

The equivalent of your string encoding in Pandas is .str.<de|encode> . Pandas 中的字符串编码的等效项是.str.<de|encode> I have generated example of the same number as demo:我生成了与演示相同数量的示例:

import pandas as pd

dataf = pd.DataFrame({
    'name':["ùìåí ìê", "ùìåí ìê"]
})

dataf["name"].str.encode('cp1252').str.decode('cp1255',errors='replace')

# result
# 0    שלום לך
# 1    שלום לך

Update更新

[assign and] a solution that can handle all rows[|columns]. [分配和]一个可以处理所有行[|列]的解决方案。

import pandas as pd

dataf = pd.DataFrame({
    'first_name':["ùìåí ìê", "ùìåí ìê"], 'last_name': ["ùìåí ìê", "ùìåí ìê"]
})

# all columns [assuming all are text]

dataf = dataf.transform(lambda x: x.str.encode('cp1252', ,errors='replace').str.decode('cp1255', errors='replace'))

# or selecting subset of columns

dataf[["first_name","last_name"]] = dataf[["first_name","last_name"]].transform(lambda x: x.str.encode('cp1252', errors='replace' ).str.decode('cp1255',errors='replace'))

dataf

#    first_name last_name
# 0 שלום לך שלום לך
# 1 שלום לך שלום לך

If there are errors, try set errors='ignore' and see what worked and which characters are still gibberish.如果有错误,请尝试设置errors='ignore'并查看哪些有效,哪些字符仍然是乱码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM