简体繁体 English

这些对diff的Ruby实现的优化是否会提高Rails应用程序的性能？

[英]Will these optimizations to my Ruby implementation of diff improve performance in a Rails app?

原文 2010-05-25 11:26:56 5 1 ruby-on-rails/ ruby/ version-control/ optimization/ levenshtein-distance

<tl;dr>
In source version control diff patch generation, would it be worth it to use the optimizations listed at the very bottom of this writing (see <optimizations> ) in my Ruby implementation of diff for making diff patches? 在源代码版本控制diff补丁生成中，是否值得在我的diff的Ruby实现中使用本文最底部列出的优化（请参见<optimizations> ）来制作diff补丁？
</tl;dr>

<introduction>
I am programming something I have never done before and there might already be tools out there to do the exact thing I am programming but at this point I am having too much fun to care so I am still going to do it from scratch, even if there is a tool for this. 我正在编程以前从未做过的事情，并且可能已经有工具可以完成我正在编程的事情，但是在这一点上，我有太多的乐趣要照顾，因此即使是从零开始，我仍然会从头开始为此有一个工具。

So anyways, I am working on a Ruby on Rails app and need a certain feature. 因此，无论如何，我正在开发Ruby on Rails应用程序，并且需要某些功能。 Basically I want each entry in a table of mine, let's say for example a table of video games, to have a stored chunk of text that represents a review or something of the sort for that table entry. 基本上，我希望我的表中的每个条目（例如，一个视频游戏表）都具有一个存储的文本块，该文本块代表该表项的评论或类似内容。 However, I want this text to be both editable by any registered user and also keep track of different submissions in a version control system. 但是，我希望此文本既可以由任何注册用户编辑，又可以跟踪版本控制系统中的不同提交。 The simplest solution I could think of is just implement a solution that keeps track of the text body and the diff patch history of different versions of the text body as objects in Ruby and then serialize it, preferably in human readable form (so I'll most likely use YAML for this) for editing if needed due to corruption by a software bug or a mistake is made by an admin doing some version editing. 我能想到的最简单的解决方案是实施一个解决方案，该方案将文本主体和文本主体不同版本的diff补丁历史记录作为Ruby中的对象，然后对其进行序列化，最好以人类可读的形式进行序列化（因此，我将最有可能使用YAML进行编辑）（由于软件错误损坏或管理员进行某些版本编辑而导致的错误），因此需要进行编辑。

So at first I just tried to dive in head first into this feature to find that the problem of generating a diff patch is more difficult that I thought to do efficiently. 因此，起初我只是想先深入了解此功能，以发现生成差异补丁程序的问题比我认为有效执行的难度更大。 So I did some research and came across some ideas. 所以我做了一些研究，发现了一些想法。 Some I have implemented already and some I have not. 有些我已经实现，有些我没有实现。 However, it all pretty much revolves around the longest common subsequence problem, as you would already know if you have already done anything with diff or diff-like features, and optimization the function that solves it. 但是，这几乎都围绕着最长的常见子序列问题，因为您已经知道您是否已经使用diff或类似diff的功能进行了任何操作，并优化了解决该问题的功能。

Currently I have it so it truncates the compared versions of the text body from the beginning and end until non-matching lines are found. 目前，我已经有了它，因此它将文本正文的比较版本从头到尾截断，直到找到不匹配的行。 Then it solves the problem using a comparison matrix, but instead of incrementing the value stored in a cell when it finds a matching line like in most longest common subsequence algorithms I have seen examples of, I increment when I have a non-matching line so as to calculate edit distance instead of longest common subsequence. 然后，它使用比较矩阵解决了问题，但是，而不是像发现的最长的常见子序列算法那样，在找到匹配行时增加存储在单元格中的值，而在遇到不匹配行时增加以计算编辑距离，而不是最长的公共子序列。 Although as far as I can tell between the two approaches, they are essentially two sides of the same coin so either could be used to derive an answer. 尽管据我所知，这两种方法实际上是同一枚硬币的两个面，所以可以使用其中任何一个来得出答案。 It then back-traces through the comparison matrix and notes when there was an incrementation and in which adjacent cell (West, Northwest, or North) to determine that line's diff entry and assumes all other lines to be unchanged. 然后，它在比较矩阵中回溯并记录何时有增量，以及在哪个相邻单元格（西，西北或北）中确定该行的差异条目，并假定所有其他行都不变。

Normally I would leave it at that, but since this is going into a Rails environment and not just some stand-alone Ruby script, I started getting worried about needing to optimize at least enough so if a spammer that somehow knew how I implemented the version control system and knew my worst case scenario entry still wouldn't be able to hit the server that bad. 通常我会留在那儿，但是由于这将进入Rails环境，而不仅仅是一些独立的Ruby脚本，所以我开始担心需要至少进行足够的优化，以便垃圾邮件发送者以某种方式知道我如何实现该版本控制系统，知道我最坏的情况仍然无法击中服务器。 After some searching and reading of research papers and articles through the internet, I've come across several that seem decent but all seem to have pros and cons and I am having a hard time deciding how well in this situation that the pros and cons balance out. 在通过互联网搜索和阅读了研究论文和文章之后，我遇到了一些看似不错但都有利有弊的问题，我很难确定这种情况下利弊之间的平衡程度。出。 So are the ones listed here worth it? 那么这里列出的那些值得吗？ I have listed them with known pros and cons. 我列出了它们的优缺点。
</introduction>

<optimizations>

Chop the compared sequences into multiple subsequences by splitting where lines are unchanged, and then truncating each section of unchanged lines at the beginning and end of each section. 通过在行未分割的地方进行拆分， 将比较后的序列分成多个子序列 ，然后在每个节的开始和结尾处截断未修改行的每个节。 Then solve the edit distance of each subsequence. 然后解决每个子序列的编辑距离。
- Pro : Changes the time increase as the changed area gets bigger from a quadratic increase to something more similar to a linear increase. Pro ：随着更改的区域变大，时间增加从二次增加变为更类似于线性增加。
- Con : Figuring out where to split already seems like you have to solve edit distance except now you don't care how it is changed. 缺点：弄清楚拆分位置似乎已经需要解决编辑距离，只是现在您不在乎如何更改它。 Would be fine if this was solvable by a process closer to solving hamming distance but a single insertion would throw this off. 如果可以通过更接近汉明距离的方法来解决这个问题，那很好，但是只需插入一次即可解决。
Use a cryptographic hash function to both convert all sequence elements into integers and ensure uniqueness. 使用加密哈希函数将所有序列元素都转换为整数并确保唯一性。 Then solve the edit distance comparing the hash integers instead of the sequence elements themselves. 然后通过比较哈希整数而不是序列元素本身来解决编辑距离。
- Pro : The operation of comparing two integers is faster than the operation of comparing two strings, so a slight performance gain is received after every comparison, which can be a lot overall. 优点：比较两个整数的操作比比较两个字符串的操作要快，因此每次比较后都会获得轻微的性能提升，这可能会很大。
- Con : Using a cryptographic hash function takes time to convert all the sequence elements and may end up costing more time to do the conversion that you gain back from the integer comparisons. 缺点：使用加密哈希函数转换所有序列元素会花费一些时间，并且可能会花费更多时间来进行从整数比较中获得的转换。 You could use the built in hash function for a string but that will not guarantee uniqueness. 您可以对字符串使用内置的哈希函数，但不能保证唯一性。
Use lazy evaluation to only calculate the three center-most diagonals of the comparison matrix and then only calculate additional diagonals as needed. 使用惰性评估仅计算比较矩阵的三个最中心对角线，然后仅根据需要计算其他对角线。 And then also use this approach to possibly remove the need on some comparisons to compare all three adjacent cells as desribed here . 然后，也可以使用这种方法来消除某些比较的需要，以比较此处描述的所有三个相邻单元格。
- Pro : Can turn an algorithm that always takes O(n * m) time and make it so only worst case scenario is that time, best case becomes practically linear, and average case is somewhere between the two. Pro ：可以转换一个总是花费O（n * m）时间的算法，并使其成为唯一的最差情况是时间，最佳情况实际上是线性的，平均情况介于两者之间。
- Con : It is an algorithm I've only seen implemented in functional programming languages and I am having a difficult time comprehending how to convert this into Ruby based on how it is described at the site linked to above. 缺点：这是我仅在函数式编程语言中实现的一种算法，根据上面链接的网站上的描述，我很难理解如何将其转换为Ruby。
Make a C module and do the hard work at the native level in C and just make a Ruby wrapper for it so Ruby can make all the calls to it that it needs. 创建一个C模块并在C的本机级别上进行艰苦的工作，只需为其编写一个Ruby包装器，以便Ruby可以对其进行所有需要的调用。
- Pro : I have to imagine that evaluating something like this in could be a LOT faster. Pro ：我必须想象一下，像这样进行评估可能会更快。
- Con : I have no idea how Rails handles apps with ruby code that has C extensions and it hurts the portability of the app. 缺点：我不知道Rails如何使用带有C扩展名的ruby代码处理应用程序，这会损害应用程序的可移植性。
This is an optimization for after the solving of edit distance, but idea is to store additional combined diffs with the ones produced by each version to make a delta-tree data structure with the most recently made diff as the root node of the tree so getting to any version takes worst case time of O(log n) instead of O(n). 这是对解决编辑距离问题之后的一种优化，但是想法是将附加的组合差异与每个版本产生的差异存储在一起，以形成一个以最新差异作为树的根节点的增量树数据结构 。升级到任何版本都需要O（log n）而不是O（n）的最坏情况。
- Pro : Would make going back to an old version a lot faster. Pro ：可以更快地返回到旧版本。
- Con : It would mean every new commit, the delta-tree would get a new root node that will cost time to reorganize the delta-tree for an operation that will be carried out a lot more often than going back a version, not to mention the unlikelihood it will be an old version. 缺点：这意味着每一次新提交，delta-tree都会获得一个新的根节点，这将花费一些时间来重组delta-tree，以进行比返回一个版本要多得多的操作，更不用说了不太可能是旧版本。

</optimizations>

So are these things worth the effort? 那么这些事情值得努力吗？

1 个解决方案

With regard to item 4 in your list, this seems to be ( from what I can tell ) how most gems work if there is any heavy lifting to be done by the code. 关于列表中的第4项，这（根据我的判断）似乎是，如果代码需要完成任何繁重的工作，那么大多数gem的工作原理。 Rails plays nice with the gem system, so you should find that if you need to incorporate this - probably alongside other optimisations you have suggested here - it should be fine, although you may need to recompile for different platforms. Rails与gem系统配合得很好，因此，您应该发现，如果需要将其合并（可能与此处建议的其他优化方法一起使用），则应该很好，尽管您可能需要针对不同的平台进行重新编译。