如何测量两个python代码块之间的相似性？

Question

Many would want to measure code similarity to catch plagiarisms, however my intention is to cluster a set of python code blocks (say answers to the same programming question) into different categories and distinguish different approaches taken by students. 许多人想衡量代码相似性以发现抄袭，但是我的目的是将一组python代码块（例如，对同一编程问题的答案）聚集到不同的类别中，并区分学生采用的不同方法。

If you have any idea how this could be achieved, I would appreciate it if you share it here. 如果您有任何想法可以实现，请在此处分享。

Answer 1

One approach would be to count then number of functions, objects, keywords possibly grouped into categories such as branching, creating, manipulating, etc., and number variables of each type. 一种方法是然后计算可能分组为诸如分支，创建，操作等类别的函数，对象，关键字的数量，以及每种类型的数量变量。 Without relying on the methods and variables being called the same name(s). 无需依赖被称为相同名称的方法和变量。

For a given problem the similar approaches will tend to come out with similar scores for these, eg: A students who used decision tree would have a high number of branch statements while one who used a decision table would have much lower. 对于给定的问题，类似的方法往往会得出类似的分数，例如：使用决策树的学生将有大量分支语句，而使用决策表的学生将具有低得多的分支语句。

This approach would be much quicker to implement than parsing the code structure and comparing the results. 这种方法比解析代码结构和比较结果要快得多。

Answer 2

You can choose any scheme you like that essentially hashes the contents of the code blocks, and place code blocks with identical hashes into the same category. 您可以选择任何您喜欢的实质上对代码块的内容进行哈希处理的方案，并将具有相同哈希值的代码块放入同一类别。

Of course, what will turn out to be similar will then depend highly on how you defined the hashing function. 当然，最终的相似之处将高度取决于您定义哈希函数的方式。 For instance, a truly stupid hashing function H(code)==0 will put everything in the same bin. 例如，一个真正愚蠢的哈希函数H（code）== 0会将所有内容放入同一个bin中。

A hard problem is finding a hashing function that classifies code blocks in a way that seems similar in a natural sense. 一个困难的问题是找到一种散列函数，该散列函数以一种看起来与自然相似的方式对代码块进行分类。 With lots of research, nobody has yet found anything better to judge this than I'll know if they are similar when I see them . 经过大量的研究，没有人能找到比我看到它们相似时更好的判断方法。

You surely do not want it to be dependent on layout/indentation/whitespace/comments, or slight changes to these will classify blocks differently even if their semantic content is identical. 您当然不希望它依赖于布局/缩进/空格/注释，否则，即使它们的语义内容相同，对它们的细微更改也会对块进行不同的分类。

There are three major schemes people have commonly used to find duplicated (or similar) code: 人们通常使用三种主要方案来查找重复的（或类似的）代码：

Metrics-based schemes, which compute the hash by counting various type of operators and operands by computing a metric. 基于指标的方案，该方案通过计算指标来计算各种类型的运算符和操作数，从而计算哈希值。 (Note: this uses lexical tokens). （注意：这使用词法标记）。 These often operate only at the function level. 这些通常仅在功能级别上运行。 I know of no practical tools based on this. 我不知道基于此的实用工具。
Lexically based schemes, which break the input stream into lexemes, convert identifiers and literals into fixed special constants (eg, treat them as undifferentiated), and then essentially hash N-grams (a sequence of N tokens) over these sequences. 基于词法的方案将输入流分解为词素，将标识符和文字转换为固定的特殊常量（例如，将它们视为未区分的），然后在这些序列上基本上对N-gram（N个令牌的序列）进行哈希处理。 There are many clone detectors based on essentially this idea; 本质上，有许多基于此思想的克隆检测器。 they work tolerably well, but also find stupid matches because nothing forces alignment with program structure boundaries. 它们可以很好地工作，但是也可以找到愚蠢的匹配项，因为没有什么强制与程序结构边界对齐。 The sequence 序列
```
  return ID; } void ID ( int ID ) { 
```

is an 11 gram which occurs frequently in C like languages but clearly isn't a useful clone). 是11克，经常在类似C的语言中出现，但显然不是有用的克隆）。 The result is that false positives tend to occur, eg, you get claimed matches where there isn't one. 结果是容易出现误报，例如，您在没有一个的情况下得到了匹配。

Abstract syntax tree based matching, (hashing over subtrees) which automatically aligns clones to language boundaries by virtue of using the ASTs, which represent the language structures directly. 基于抽象语法树的匹配（散列在子树上），该匹配通过使用直接表示语言结构的AST自动将克隆与语言边界对齐。 (I'm the author of the original paper on this, and build a commercial product CloneDR based on the idea, see my bio). （我是该论文的原始作者，并根据这个想法构建了一个商业产品CloneDR，请参见我的简历）。 These tools have the advantage that they can match code that contains sequences of tokens of different lengths in the middle of a match, eg, one statement (of arbitrary size) is replaced by another. 这些工具的优势在于，它们可以在匹配中间匹配包含不同长度的标记序列的代码，例如，一个（任意大小的）语句被另一条语句替换。

This paper provides a survey of the various techniques: http://www.cs.usask.ca/~croy/papers/2009/RCK_SCP_Clones.pdf . 本文提供了各种技术的概述： http : //www.cs.usask.ca/~croy/papers/2009/RCK_SCP_Clones.pdf 。 It shows that AST-based clone detection tools appear to be the most effective at producing clones that people agree are similar blocks of code, which seems key to OP's particular interest; 它表明，基于AST的克隆检测工具似乎最有效地产生了人们一致认为类似的代码块的克隆，这似乎是OP特殊兴趣的关键。 see Table 14. 参见表14。

[There are graph-based schemes that match control and data flow graphs. [有一些基于图的方案可以匹配控制图和数据流图。 They should arguably produce even better matches but apparantly do not do much better in practice.] 可以说，他们应该产生更好的比赛，但实际上在实践中并没有做得更好。]

如何测量两个python代码块之间的相似性？

问题描述

2 个解决方案

解决方案1
2 2016-09-28 05:51:30

解决方案2
2 2016-09-28 07:54:56

如何测量两个python代码块之间的相似性？

问题描述

2 个解决方案

解决方案1 2 2016-09-28 05:51:30

解决方案2 2 2016-09-28 07:54:56

解决方案1
2 2016-09-28 05:51:30

解决方案2
2 2016-09-28 07:54:56