简体   繁体   English

多余的“隐藏”字符弄乱了SQL中的等于测试

[英]Extra “hidden” characters messing with equals test in SQL

I am doing a database (Oracle) migration validation and I am writing scripts to make sure the target matches the source. 我正在执行数据库(Oracle)迁移验证,并且正在编写脚本以确保目标与源匹配。 My script is returning values that, when you look at them, look equal. 我的脚本返回的值在您查看它们时看起来是相等的。 However, they are not. 但是,事实并非如此。

For instance, the target has PREAPPLICANT and the source has PREAPPLICANT . 例如,目标具有PREAPPLICANT ,而源具有PREAPPLICANT When you look at them in text, they look fine. 当您在文本中查看它们时,它们看起来不错。 But when I converted them to hex, it shows 50 52 45 41 50 50 4c 49 43 41 4e 54 for the target and 50 52 45 96 41 50 50 4c 49 43 41 4e 54 for the source. 但是,当我将它们转换为十六进制时,对于目标显示50 52 45 41 50 50 4c 49 43 41 4e 54对于源显示50 52 45 96 41 50 50 4c 49 43 41 4e 54 So there is an extra 96 in the hex. 因此,十六进制中有一个额外的96

在此处输入图片说明

So, my questions are: 因此,我的问题是:

  1. What is the 96 char? 什么是96字符?
  2. Would you say that the target has incorrect data because it did not bring the char over? 您是否可以说目标没有正确的数据,因为它没有将字符传递过来? I realize this question may be a little subjective, but I'm asking it from the standpoint of "what is this character and how did it get here?" 我意识到这个问题可能有点主观,但我是从“这个角色是什么以及它是怎么来的?”的角度提出这个问题的。
  3. Is there a way to ignore this character in the SQL script so that the equality check passes? 有没有一种方法可以忽略SQL脚本中的此字符,以便通过相等性检查? (do I want the equality to pass or fail here?) (我想让平等在这里通过还是失败?)

It looks like you have Windows-1252 character set here. 您似乎在此处设置了Windows-1252字符。 https://en.wikipedia.org/wiki/Windows-1252 https://en.wikipedia.org/wiki/Windows-1252

Character 96 is an En Dash. 角色96是En Dash。 This makes sense, as the data was PREAPPLICANT. 这是有道理的,因为数据是预申请的。

One user provided "PREAPPLICANT" and another provided "PRE-APPLICANT" and Windows helpfully converted their proper dash into an en dash. 一个用户提供了“ PREAPPLICANT”,另一个用户提供了“ PRE-APPLICANT”,Windows会帮助将其适当的破折号转换为一个破折号。

As such, this doesn't appear to be an error in data, more an error in character sets. 这样,这似乎不是数据错误,更多的是字符集错误。 You should be able to filter these out without too much effort but then you are changing data. 您应该能够毫不费力地将它们过滤掉,但是随后您正在更改数据。 It's kind of like when one person enters "Mr Jones" and another enters "Mr. Jones"--you have to decide how much data massaging you want to do. 这就好比一个人输入“ Jones先生”,另一个人输入“ Jones先生”时,您必须决定要处理多少数据。

As you probably already have done, use the DUMP function to get the byte representation of the data in code of you wish to inspect for weirdness. 正如您可能已经做过的那样,请使用DUMP函数在您要检查怪异的代码中获取数据的字节表示形式。

Here's some text with plain ASCII: 这是一些纯ASCII文本:

select dump('Dashes-and "smart quotes"') from dual;

Typ=96 Len=25: 68,97,115,104,101,115,45,97,110,100,32,34,115,109,97,114,116,32,113,117,111,116,101,115,34 Typ = 96 Len = 25:68,97,115,104,101,115,45,97,110,100,32,34,115,109,97,114,116,32,113,117,111,116,101,115,34

Now introduce funny characters: 现在介绍有趣的角色:

select dump('Dashes—and “smart quotes”') from dual;

Typ=96 Len=31: 68,97,115,104,101,115,226,128,148,97,110,100,32,226,128,156,115,109,97,114,116,32,113,117,111,116,101,115,226,128,157 Typ = 96 Len = 31:68,97,115,104,101,115,226,128,148,97,110,100,32,226,128,156,115,109,97,114,116,32,113,117,111,111,116,101,115,226,128,157

In this case, the number of bytes increased because my DB is using UTF8. 在这种情况下,字节数增加了,因为我的数据库正在使用UTF8。 Numbers outside of the valid range for ASCII stand out and can be inspected further. 超出ASCII有效范围的数字比较突出,可以进行进一步检查。

Here's another way to see the special characters: 这是查看特殊字符的另一种方法:

select asciistr('Dashes—and “smart quotes”') from dual;

Dashes\\2014and \\201Csmart quotes\\201D 破折号\\ 2014和\\ 201C智能引号\\ 201D

This one converts non-ASCII characters into backslashed Unicode hex. 此代码将非ASCII字符转换为反斜杠Unicode十六进制。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM