简体   繁体   English

从.Net程序集中获取AST,无需源代码(IL代码)

[英]Get AST from .Net assembly without source code (IL code)

I'd like to analyze .Net assemblies to be language independent from C#, VB.NET or whatever. 我想分析.Net程序集与C#,VB.NET或其他任何语言无关。
I know Roslyn and NRefactory but they only seem to work on C# source code level? 我知道Roslyn和NRefactory,但他们似乎只在C#源代码级别上工作?
There is also the " Common Compiler Infrastructure: Code Model and AST API " project on CodePlex which claims to "supports a hierarchical object model that represents code blocks in a language-independent structured form" which sound exactly for what I looking for. CodePlex上还有“ 通用编译器基础设施:代码模型和AST API ”项目,该项目声称“支持一种表示与语言无关的结构化形式的代码块的分层对象模型”,它完全符合我的要求。
However I'am unable to find any useful documentation or code that is actual doing this. 但是,我无法找到任何有用的文档或实际执行此操作的代码。
Any advice how to archive this? 有什么建议如何存档?
Can Mono.Cecil maybe doing something? Mono.Cecil可以做点什么吗?

You can do this and there is also one (although tiny) example of this in the source of ILSpy. 你可以做到这一点 ,在ILSpy的源代码中也有一个(虽然很小)的例子

var assembly = AssemblyDefinition.ReadAssembly("path/to/assembly.dll");
var astBuilder = new AstBuilder(new DecompilerContext(assembly.MainModule));
decompiler.AddAssembly(assembly);
astBuilder.SyntaxTree...

The CCI Code Model is somewhere between a IL disassembler and full C# decompiler: it gives your code some structure (eg if statements and expressions), but it also contains some low level stack operations like push and pop . CCI代码模型介于IL反汇编程序和完整的C#反编译器之间:它为您的代码提供了一些结构(例如if语句和表达式),但它还包含一些低级别的堆栈操作,如pushpop

CCI contains a sample that shows this: PeToText . CCI包含一个示例: PeToText

For example, to get Code Model for the first method of the Program type (in the global namespace), you could use code like this: 例如,要获取Program Model的第一个方法的代码模型(在全局命名空间中),您可以使用如下代码:

string fileName = "whatever.exe";

using (var host = new PeReader.DefaultHost())
{
    var module = (IModule)host.LoadUnitFrom(fileName);
    var type = (ITypeDefinition)module.UnitNamespaceRoot.Members
        .Single(m => m.Name.Value == "Program");
    var method = (IMethodDefinition)type.Members.First();
    var methodBody = new SourceMethodBody(method.Body, host, null, null);
}

To demonstrate, if you decompile the above code and show it using PeToText, you're going to get: 为了演示,如果您反编译上面的代码并使用PeToText显示它,您将得到:

Microsoft.Cci.ITypeDefinition local_3;
Microsoft.Cci.ILToCodeModel.SourceMethodBody local_5;
string local_0 = "C:\\code\\tmp\\nuget tmp 2015\\bin\\Debug\\nuget tmp 2015.exe";
Microsoft.Cci.PeReader.DefaultHost local_1 = new Microsoft.Cci.PeReader.DefaultHost();
try
{
    push (Microsoft.Cci.IModule)local_1.LoadUnitFrom(local_0).UnitNamespaceRoot.Members;
    push Program.<>c.<>9__0_0;
    if (dup == default(System.Func<Microsoft.Cci.INamespaceMember, bool>))
    {
        pop;
        push Program.<>c.<>9.<Main0>b__0_0;
        Program.<>c.<>9__0_0 = dup;
    }
    local_3 = (Microsoft.Cci.ITypeDefinition)System.Linq.Enumerable.Single<Microsoft.Cci.INamespaceMember>(pop, pop);
    local_5 = new Microsoft.Cci.ILToCodeModel.SourceMethodBody((Microsoft.Cci.IMethodDefinition)System.Linq.Enumerable.First<Microsoft.Cci.ITypeDefinitionMember>(local_3.Members).Body, local_1, (Microsoft.Cci.ISourceLocationProvider)null, (Microsoft.Cci.ILocalScopeProvider)null, 0);
}
finally
{
    if (local_1 != default(Microsoft.Cci.PeReader.DefaultHost))
    {
        local_1.Dispose();
    }
}

Of note are all those push , pop and dup statements and the lambda caching condition. 值得注意的是所有pushpopdup语句以及lambda缓存条件。

As far as I know, it's not possible to build AST from binary (without sources) since AST itself generated by parser as part of compilation process from sources. 据我所知,从二进制(没有源代码)构建AST是不可能的,因为AST本身是由解析器生成的,作为源代码编译过程的一部分。 Mono.Cecil won't help because you can only modify opcodes/metadata with them, not analyze assembly. Mono.Cecil无法提供帮助,因为您只能使用它们修改操作码/元数据,而不能分析汇编。

But since it's .NET you can dump IL code from dll with help of ildasm. 但是因为它是.NET,你可以在ildasm的帮助下从dll转储IL代码。 Then you can pass generated sources to any parser with CIL dictionary hooked up and get AST from parser. 然后,您可以将生成的源传递给任何解析器,并连接CIL字典并从解析器获取AST。 The problem is that as far as I know there is only one publically available CIL grammar for parser, so you don't really have a choice. 问题是,据我所知,解析器只有一个公开可用的CIL语法,所以你真的没有选择。 And ECMA-355 is big enough so it's bad idea to write your own grammar. ECMA-355足够大,所以编写自己的语法是个坏主意。 So I can suggest you only one solution: 所以我建议你只有一个解决方案:

  1. Pass assembly to ildasm.exe to get CIL. 将程序集传递给ildasm.exe以获取CIL。
  2. Then pass CIL to ANTLR v3 parser with this CIL grammar wired up (note it's a little bit outdated - grammar created at 2004 and latest CIL specification is 2006, but CIL doesn't really change to much) 然后通过这个 CIL语法将CIL传递给ANTLR v3解析器(注意它有点过时了 - 2004年创建的语法和最新的CIL规范是2006年,但CIL并没有真正改变太多)
  3. After that you can freely access AST generated by ANTLR 之后,您可以自由访问ANTLR生成的AST

Note that you will need ANTLR v3 not v4, since grammar written for 3rd version, and it's hardly possible to port it to v4 without good knowledge of ANTLR syntax. 请注意,您将需要ANTLR v3而不是v4,因为为第3版编写了语法,并且在不了解ANTLR语法的情况下几乎不可能将其移植到v4。

Also you can try to look into new Microsoft ryujit compiler sources at github (part of CoreCLR) - I don't sure that it's helps, but in theory it must contains CIL grammar and parser implementations since it works with CIL code. 您也可以尝试在github(CoreCLR的一部分)中查看新的Microsoft ryujit编译器源 - 我不确定它是否有帮助,但理论上它必须包含CIL语法和解析器实现,因为它适用于CIL代码。 But it's written in CPP, have enormous code base and lacks of documentation since it's in active development stage, so it's may be easier to stuck with ANTLR. 但它是用CPP编写的,具有庞大的代码库和缺乏文档,因为它处于活跃的开发阶段,因此使用ANTLR可能更容易。

If you treat the .net binary file as a stream of bytes, you ought to be able to "parse" it just fine. 如果将.net二进制文件视为字节流,则应该能够“解析”它。

You simply write a grammar whose tokens are essentially bytes. 你只需编写一个其令牌基本上是字节的语法。 You can certainly build a classical lexer/parser with almost any set of lexer/parser tools by defining the lexer to read single bytes as tokens. 通过定义词法分析器将单个字节读取为标记,您当然可以使用几乎任何词法分析器/解析器工具构建经典词法分析器/解析器。

You can then build the AST using standard AST-building machinery for the parsing engine (on your own for YACC, automatically with ANTLR4). 然后,您可以使用标准AST构建机制为解析引擎构建AST(对于YACC,您自己使用ANTLR4自行构建)。

What you will discover, of course, is that "parsing" isn't enough; 当然,你会发现“解析”是不够的; you'll still need to build symbol tables, and carry out control and data flow analyses if you are going to do serious analysis of the corresponding code. 如果要对相应的代码进行认真分析,您仍需要构建符号表,并执行控制和数据流分析。 See my essay on LifeAfterParsing. 请参阅我关于LifeAfterParsing的文章。

You will also likely have to take into account "distinguished" functions that provide key runtime facilities to the particular programming languages that actually generated the CIL code. 您还可能必须考虑“区分”函数,这些函数为实际生成CIL代码的特定编程语言提供关键运行时功能。 And these will make your analyzers language-dependent. 这些将使您的分析仪依赖于语言。 Yes, you still get to share the part of the analysis that works on generic CIL. 是的,您仍然可以分享适用于通用CIL的分析部分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM