散列 DWG 文件 - 空模型不会收到相同的 hash 代码

Question

I want to use a hash function (SHA1) to determine whether two DWG models from AutoCAD 2018 are identical or not.我想使用 hash function (SHA1) 来确定 AutoCAD 2018 中的两个 DWG 模型是否相同。 To test if this works, I created two empty dwg files and used the SHA1 function in c# (System.Security.Cryptography.sha1) to calculate the hash code.为了测试这是否有效，我创建了两个空的 dwg 文件并使用 c# (System.Security.Cryptography.sha1) 中的 SHA1 function 来计算 Z0800FC577294C34E0B4599AD2839435 代码。

I was expecting resulting codes to be identical, since I only started two new empty models and then saved them directly.我期待生成的代码是相同的，因为我只启动了两个新的空模型，然后直接保存它们。 The only data which should differ is the name and date & time of creation (all meta data, which shouldn't play a role for the algorithm).唯一应该不同的数据是创建的名称和日期和时间（所有元数据，不应该对算法起作用）。 Yet, the result was two different hash codes.然而，结果是两个不同的 hash 代码。

So I looked into the byte array of the two files.所以我查看了两个文件的字节数组。 The only thing I could get out of it (a lot of lines which I could not interpret in any way), was that some of the meta data as date and time of creation is written in the byte array.我唯一能从中得到的东西（很多行我无法以任何方式解释）是一些元数据作为创建日期和时间被写入字节数组中。 Therefore the two files are actually not identical, eventhough the models are.因此，即使模型相同，这两个文件实际上也不相同。 So there will always be a different hash code for all hashed DWG models.因此，所有散列 DWG 模型总会有不同的 hash 代码。

Does anyone know of this problem or a work-around for my issue?有谁知道这个问题或我的问题的解决方法？

Code used to calculate hash:用于计算 hash 的代码：

static void GetFileHash(string filePath)
        {
            var sha1 = new SHA1CryptoServiceProvider();
            byte[] bArray = File.ReadAllBytes(filePath);
            byte[] hashByte = sha1.ComputeHash(bArray);
            string hashCode = BitConverter.ToString(hashByte);

            //save file path, hash and byte array to text file
            string[] lines = { filePath, hashCode, System.Text.Encoding.Default.GetString(bArray)};               System.IO.File.WriteAllLines(@".\HashTest.txt", lines);
        }

Snippet of the byte array in which the saving time and date is captured:捕获保存时间和日期的字节数组的片段：

A pp I nfo D ata L ist �H�� 应用程序信息数据列表 �H��

M��ρ��P� 2 2. 0. 4 9. 0. 0 ��%�דI��(o��r A utodesk DW G. T hisfileisa T r usted DWG lastsavedbyan A utodeskappli c ationo r A utodeskli c ensedappli c atio n. ��Oh�� +'��J < st r ing > M ax < / st r ing > < datetime > 2 0 2 0 - 0 7 - 1 4 T 0 9: 5 1: 3 3 < / datetime > < st r ing > A uto C AD 2 0 1 8 M��ρ��P� 2 2. 0. 4 9. 0. 0 ��%�דI��(o��r A utodesk DW G. T hisfileisa T r 使用 DWG lastsavedbyan A utodeskappli Z4A8A08F09D37B7379534903840ZB r A utodeskli c ensedappli c atio n. ��Oh�� +'��J < st r ing > M ax < / st r ing > <DateTime> 2 0 2 0 2 0-0 7-1 4 T 0 9：5 1：3 3 3 3 < / dateTime> < / dateTime> < / dateTime> > < st r ing > 自动 C AD 2 0 1 8 < / st r ing > < st r ing > O. 4 9. 0. 0 < / st r ing > < datetime > 2 0 2 0 - 0 7 - 1 4 T 0 9: 5 1: 2 9 < / datetime > ��Q�βD��;��D� " < P r odu c t I nfo r mationname = \ " A uto C AD \ " build _ ve r sion = \ " O. 4 9. 0. 0 ( x 6 4 ) \ " r egist r y _ ve r sion = \ " 2 2. 0 \ " install _ id _ st Z4B43B0AEE35624CD95B910189B3D < / st r ing > < st r ing > O. 4 9. 0. 0 < / st r ing > <DateTime> 2 0 2 0 2 0-0 7-1 4 T 0 9：5 1：2 9 < / dateTime> < / dateTime> < / dateTime> < / p Z4B0AEE3B0AEE3B0AEE35624CD95B95B95189B91189B31189B3DC2C2C2AECERET < / t �Q�βD��;��D� " < P r odu c t I nfo r mationname = \ " A uto C AD \ " build _ ve r sion = \ " O. 4 9. 0. 0 ( x 6 4 ) \ " r egist r y _ ve r sion = \ " 2 2. 0 \ " install _ id _ st Z4B43B0AEE35624CD95B910189B3D C231Z ing = \ " A C AD - 1 0 0 1: 4 0 9 \ " r egist r y _ lo c ale ID = \ " 1 0 3 3 \ " > " C231Z ing = \ " A C AD - 1 0 0 1: 4 0 9 \ " r egist r y _ lo c ale ID = \ " 1 0 3 3 \ " > "

Answer 1

When a new empty model is created, it will add some stuff for you like a layer "0", the model space and the paper space.当创建一个新的空 model 时，它会为您添加一些东西，例如图层“0”、model 空间和图纸空间。 Which will be added to the cad database table record and assign some id's which will be different no matter how many new model your create.无论您创建多少新的 model，它们都将被添加到 cad 数据库表记录中并分配一些不同的 id。

For example.例如。 when you look into the data assigned to layer "0" of 2 new empty models it will look like that:当您查看分配给 2 个新空模型的“0”层的数据时，它看起来像这样：

// Layer "0" from Model 1
"((-1,(2956860066048))(0,LAYER)(5,10)(102,{ACAD_XDICTIONARY)(360,(2956924193344))(102,})(330,(2956860065824))(100,AcDbSymbolTableRecord)(100,AcDbLayerTableRecord)
(2,0)(70,0)(62,7)(6,Continuous)(290,1)(370,-3)(390,(2956860066032))(347,(2956860067552))(348,(0)))"

// Layer "0" from Model 2
"((-1,(2991422834944))(0,LAYER)(5,10)(102,{ACAD_XDICTIONARY)(360,(2991422843456))(102,})(330,(2991422834720))(100,AcDbSymbolTableRecord)(100,AcDbLayerTableRecord)
(2,0)(70,0)(62,7)(6,Continuous)(290,1)(370,-3)(390,(2991422834928))(347,(2991422836448))(348,(0)))"

Those are quite different and will be in the.dwg binary content you are comparing the hash which is unknown what it stores inside it.这些是完全不同的，并且将在.dwg 二进制内容中，您正在比较 hash，它不知道它在其中存储了什么。 In order for the hash to be the same you would need to copy paste the file.为了使 hash 相同，您需要复制粘贴文件。

A workaround could be to save the file into dxf format instead which are human readable so you can ignore some data to consider the files identical.一种解决方法可能是将文件保存为 dxf 格式，而不是人类可读的格式，因此您可以忽略一些数据以认为文件相同。

Answer 2

Nowhere does any sort of a hash enter the picture, so to speak.可以这么说，没有任何类型的 hash 进入图片。 You don't care what's in the DWG file, you care that it looks the same!您不关心 DWG 文件中的内容，您只关心它看起来是否相同！

In general this can't work, unless the file you're hashing is in the canonical form.一般来说，这是行不通的，除非您要散列的文件是规范形式。 That'd also mean that such a canonical form has to exist - not a given at all, I don't know offhand about whether there is a canonical form of DWG, or whether it's even an easy problem to create one.这也意味着必须存在这样的规范形式 - 根本不是给定的，我不知道是否存在 DWG 的规范形式，或者创建一个是否是一个简单的问题。 such that visually identical results would have same DWG form, In general, the answer to this is a solid no .这样视觉上相同的结果将具有相同的 DWG 形式，通常，对此的答案是肯定的。

The summary of the problem is that files that appear visually identical when rendered at a particular resolution may have an extremely large number of possible permutations in their DWG representation - in practice you could consider it infinite.问题的总结是，在以特定分辨率渲染时看起来相同的文件可能在其 DWG 表示中具有大量可能的排列 - 实际上您可以认为它是无限的。

So, you cannot hash the DWG file itself, not even if you had a DWG library that could read the primitives - you'd need to use a computational geometry library to attempt to process the data into some canonical form.所以，你不能 hash DWG 文件本身，即使你有一个可以读取图元的 DWG 库 - 你需要使用计算几何库来尝试将数据处理成某种规范形式。

The best you can do with limited resources is to rasterize both DWGs, and then run a visual comparison between the bitmaps thus obtained, at a reasonable printout resolution (say 600-1200dpi).使用有限资源可以做的最好的事情是光栅化两个 DWG，然后以合理的打印输出分辨率（例如 600-1200dpi）对由此获得的位图进行视觉比较。 The comparison algorithm should be insensitive to common pitfalls such as different "hardness" of line edges, but it should be sensitive to changes that alter the visual meaning of the document - this may not be straightforward either, and you'd need to look for a library/package that can do such comparisons for you, specifically targeted for line art and not photographs (.,).比较算法应该对常见的缺陷不敏感，例如线条边缘的不同“硬度”，但它应该对改变文档视觉含义的变化敏感 - 这也可能不简单，你需要寻找可以为您进行此类比较的库/包，专门针对线条艺术而不是照片 (.,)。 The files can also have different absolute coordinates for the primitives (due to transformations yielding same output), and slightly different scaling.这些文件还可以具有不同的图元绝对坐标（由于转换产生相同的输出），并且缩放比例略有不同。

Thus, in most likelihood, you'd need to first visually align the bitmaps (there are algorithms for this that can have eg translation, scaling and rotation as degrees of freedom), then re-rasterize the second image in the pair taking the alignment into account, and then run a visual comparison, and produce some sort of a "score".因此，最有可能的是，您需要首先在视觉上对齐位图（有一些算法可以将平移、缩放和旋转作为自由度），然后重新光栅化该对中的第二个图像，采用 alignment考虑在内，然后进行视觉比较，并产生某种“分数”。 You could also highlight the differences and display them on screen.您还可以突出显示差异并将它们显示在屏幕上。

散列 DWG 文件 - 空模型不会收到相同的 hash 代码

问题描述

2 个解决方案

解决方案1
0 2020-07-17 19:45:05

解决方案2
0 2020-07-17 20:19:20

散列 DWG 文件 - 空模型不会收到相同的 hash 代码

问题描述

2 个解决方案

解决方案1 0 2020-07-17 19:45:05

解决方案2 0 2020-07-17 20:19:20

解决方案1
0 2020-07-17 19:45:05

解决方案2
0 2020-07-17 20:19:20