[英]Extra “hidden” characters messing with equals test in SQL
I am doing a database (Oracle) migration validation and I am writing scripts to make sure the target matches the source. 我正在执行数据库(Oracle)迁移验证,并且正在编写脚本以确保目标与源匹配。 My script is returning values that, when you look at them, look equal. 我的脚本返回的值在您查看它们时看起来是相等的。 However, they are not. 但是,事实并非如此。
For instance, the target has PREAPPLICANT
and the source has PREAPPLICANT
. 例如,目标具有PREAPPLICANT
,而源具有PREAPPLICANT
。 When you look at them in text, they look fine. 当您在文本中查看它们时,它们看起来不错。 But when I converted them to hex, it shows 50 52 45 41 50 50 4c 49 43 41 4e 54
for the target and 50 52 45 96 41 50 50 4c 49 43 41 4e 54
for the source. 但是,当我将它们转换为十六进制时,对于目标显示50 52 45 41 50 50 4c 49 43 41 4e 54
对于源显示50 52 45 96 41 50 50 4c 49 43 41 4e 54
。 So there is an extra 96
in the hex. 因此,十六进制中有一个额外的96
。
So, my questions are: 因此,我的问题是:
96
char? 什么是96
字符? It looks like you have Windows-1252 character set here. 您似乎在此处设置了Windows-1252字符。 https://en.wikipedia.org/wiki/Windows-1252 https://en.wikipedia.org/wiki/Windows-1252
Character 96 is an En Dash. 角色96是En Dash。 This makes sense, as the data was PREAPPLICANT. 这是有道理的,因为数据是预申请的。
One user provided "PREAPPLICANT" and another provided "PRE-APPLICANT" and Windows helpfully converted their proper dash into an en dash. 一个用户提供了“ PREAPPLICANT”,另一个用户提供了“ PRE-APPLICANT”,Windows会帮助将其适当的破折号转换为一个破折号。
As such, this doesn't appear to be an error in data, more an error in character sets. 这样,这似乎不是数据错误,更多的是字符集错误。 You should be able to filter these out without too much effort but then you are changing data. 您应该能够毫不费力地将它们过滤掉,但是随后您正在更改数据。 It's kind of like when one person enters "Mr Jones" and another enters "Mr. Jones"--you have to decide how much data massaging you want to do. 这就好比一个人输入“ Jones先生”,另一个人输入“ Jones先生”时,您必须决定要处理多少数据。
As you probably already have done, use the DUMP function to get the byte representation of the data in code of you wish to inspect for weirdness. 正如您可能已经做过的那样,请使用DUMP函数在您要检查怪异的代码中获取数据的字节表示形式。
Here's some text with plain ASCII: 这是一些纯ASCII文本:
select dump('Dashes-and "smart quotes"') from dual;
Typ=96 Len=25: 68,97,115,104,101,115,45,97,110,100,32,34,115,109,97,114,116,32,113,117,111,116,101,115,34 Typ = 96 Len = 25:68,97,115,104,101,115,45,97,110,100,32,34,115,109,97,114,116,32,113,117,111,116,101,115,34
Now introduce funny characters: 现在介绍有趣的角色:
select dump('Dashes—and “smart quotes”') from dual;
Typ=96 Len=31: 68,97,115,104,101,115,226,128,148,97,110,100,32,226,128,156,115,109,97,114,116,32,113,117,111,116,101,115,226,128,157 Typ = 96 Len = 31:68,97,115,104,101,115,226,128,148,97,110,100,32,226,128,156,115,109,97,114,116,32,113,117,111,111,116,101,115,226,128,157
In this case, the number of bytes increased because my DB is using UTF8. 在这种情况下,字节数增加了,因为我的数据库正在使用UTF8。 Numbers outside of the valid range for ASCII stand out and can be inspected further. 超出ASCII有效范围的数字比较突出,可以进行进一步检查。
Here's another way to see the special characters: 这是查看特殊字符的另一种方法:
select asciistr('Dashes—and “smart quotes”') from dual;
Dashes\\2014and \\201Csmart quotes\\201D 破折号\\ 2014和\\ 201C智能引号\\ 201D
This one converts non-ASCII characters into backslashed Unicode hex. 此代码将非ASCII字符转换为反斜杠Unicode十六进制。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.