简体   繁体   English

在结构中嵌入函数汇编代码

[英]embed a functions assembly code in a struct

I've a rather special question: is it possible in C/++ (both because I am sure the question is the same in both languages) to specify a functions's location?我有一个相当特殊的问题:是否可以在 C/++ 中指定函数的位置(两者都是因为我确定这两种语言的问题是相同的)? Why?为什么? I have a very large list of function pointers, and I want to eliminate them.我有一个非常大的函数指针列表,我想消除它们。

(Currently) This looks like that(repeated over lika a million times, stored in the user's RAM): (目前)这看起来像这样(重复了一百万次,存储在用户的 RAM 中):

struct {
    int i;
    void(* funptr)();
} test;

Because I know that in most assembly languages, functions are just "goto" directives, I had the following idea.因为我知道在大多数汇编语言中,函数只是“goto”指令,所以我有以下想法。 Is it possible to optimize the above construct so that it looks like that?是否可以优化上述构造使其看起来像这样?

struct {
    int i;
    // embed the assembler of the function here
    // so that all the functions
    // instructions are located here
    // like this: mov rax, rbx
    // jmp _start ; just demo code
} test2;

In the end, the thing should look like this in memory: An int holding any value, followed by the function's assembly code, referenced by test2.最后,这件事在内存中应该是这样的:一个包含任意值的 int,后跟函数的汇编代码,由 test2 引用。 I should be able to call these functions like that: ((void(*)()) (&pointerToTheStruct + sizeof(int)))();我应该能够像这样调用这些函数: ((void(*)()) (&pointerToTheStruct + sizeof(int)))();

You might think that I'm insane to optimize the app that way, and I cannot disclose any more details on it's function, but if anyone has some pointers on how solve this problem, I would appreciate it.您可能认为我以这种方式优化应用程序很疯狂,我无法透露有关其功能的更多详细信息,但是如果有人对如何解决此问题有一些建议,我将不胜感激。 I do not think that there is a standard way to this, so any hacky way to do this via inline assembler/other crazy things is also appreciated!我不认为有一个标准的方法,所以任何通过内联汇编器/其他疯狂的东西来做到这一点的黑客方法也值得赞赏!

The only thing you really have to do is make the compiler aware of the (constant) value of the function pointer you want in the struct.你真正需要做的唯一一件事就是让编译器知道你想要在结构中的函数指针的(常量)值。 The compiler will then (presumably/hopefully) inline that function call wherever it sees it called through that function pointer:然后,编译器将(可能/希望)内联该函数调用,无论它通过该函数指针调用它的位置:

template<void(*FPtr)()>
struct function_struct {
    int i;
    static constexpr auto funptr = FPtr;
};

void testFunc()
{
    volatile int x = 0;
}

using test = function_struct<testFunc>;

int main()
{
    test::funptr();
}

Demo - no call or jmp after optimization.演示- 优化后没有calljmp

It remains unclear what the point of the int i is.目前尚不清楚int i是什么。 Note that the code is not technically "directly after the i " here, but it is even more unclear how you'd expect instances of the struct to look like (is the code in them or is it "static" in a way? I feel there is some misunderstanding here on your part what compilers actually produce...).请注意,这里的代码在技术上并不是“直接在i ”,但更不清楚您期望结构实例的外观(是其中的代码还是某种意义上的“静态”?我)感觉您对编译器实际生成的内容存在一些误解......)。 But consider the ways that compiler inlining can help you and you might find the solution you need.但是考虑编译器内联可以帮助您的方式,您可能会找到所需的解决方案。 If you're worried about executable size after inlining, tell the compiler and it will compromise between speed and size.如果您担心内联后的可执行文件大小,请告诉编译器,它会在速度和大小之间做出妥协。

This sounds like a terrible idea for a lot of reasons that probably won't save memory, and will hurt performance by diluting L1I-cache with data and L1D-cache with code.这听起来像一个糟糕的主意,原因有很多,可能不会节省内存,并且会通过用数据稀释 L1I 缓存和用代码稀释 L1D 缓存来损害性能。 And worse if you ever modify or copy objects: self-modifying code stalls.更糟糕的是,如果您曾经修改或复制对象:自修改代码会停止。

But yes, this would be possible in C99/C11 with a flexible array member at the end of the struct, which you cast to a function pointer.但是,是的,这在 C99/C11 中是可能的,在结构的末尾有一个灵活的数组成员,您可以将其转换为函数指针。

struct int_with_code {
    int i;
    char code[];   // C99 flexible array member.  GNU extension in C++
                   // Store machine code here
                   // you can't get the compiler to do this for you.  Good Luck!
};

void foo(struct int_with_code *p) {
    // explicit C-style cast compiles as both C and C++
    void (*funcp)(void) = ( void (*)(void) ) p->code;
    funcp();
}

Compiler output from clang7.0, on the Godbolt compiler explorer is the same when compiled as either C or C++.当编译为 C 或 C++ 时,clang7.0 的编译器输出在 Godbolt 编译器资源管理器上是相同的。 This is targeting the x86-64 System V ABI, where the first function arg is passed in RDI.这是针对 x86-64 System V ABI,其中第一个函数 arg 在 RDI 中传递。

# this is the code that *uses* such an object, not the code that goes in its code[]
# This proves that it compiles,
#  without showing any way to get compiler-generated code into code[]
foo:                                    # @foo
    add     rdi, 4         # move the pointer 4 bytes forward, to point at code[]
    jmp     rdi                     # TAILCALL

(If you leave out the (void) arg-type declaration in C, the compiler will zero AL first in the x86-64 SysV calling convention, in case its actually a variadic function, because it's passing no FP args in registers.) (如果您省略 C 中的(void) arg 类型声明,编译器将在 x86-64 SysV 调用约定中首先将 AL 置零,以防它实际上是一个可变参数函数,因为它在寄存器中不传递 FP 参数。)


You'd have to allocate your objects in memory that was executable (normally not done unless they're const with static storage), eg compile with gcc -zexecstack .您必须在可执行的内存中分配对象(通常不会完成,除非它们是具有静态存储的const ),例如使用gcc -zexecstack编译。 Or use a custom mmap/mprotect or VirtualAlloc/VirtualProtect on POSIX or Windows.或者在 POSIX 或 Windows 上使用自定义 mmap/mprotect 或 VirtualAlloc/VirtualProtect。

Or if your objects are all statically allocated, it might be possible to massage compiler output to turn functions in the .text section into objects by adding an int member right before each one.或者,如果您的对象都是静态分配的,则可以通过在每个对象之前添加一个int成员来调整编译器输出以将.text部分中的函数转换为对象。 Maybe with some .section and linker tricks, and maybe a linker script, you could even somehow automate it.也许使用一些.section和链接器技巧,也许还有链接器脚本,您甚至可以以某种方式自动化它。

But unless they're all the same length (eg with padding like char code[60] ), that won't form an array you can index, so you'll need some way of referencing all these variable-length object.但是除非它们的长度都相同(例如填充像char code[60] ),则不会形成可以索引的数组,因此您需要某种方式来引用所有这些可变长度对象。

There are potentially huge performance downsides if you ever modify an object before calling its function: on x86 you'll get self-modifying-code pipeline nuke for executing code near a just-written memory location.如果您在调用对象的函数之前修改它,则可能存在巨大的性能下降:在 x86 上,您将获得自修改代码管道核,用于刚写入的内存位置附近执行代码。

Or if you copied an object before calling its function: x86 pipeline flush, or on other ISAs you need to manually flush caches to get the I-cache in sync with D-cache (so the newly-written bytes can be executed).或者,如果您在调用其函数之前复制了一个对象:x86 管道刷新,或者在其他 ISA 上,您需要手动刷新缓存以使 I-cache 与 D-cache 同步(因此可以执行新写入的字节)。 But you can't copy such objects because their size isn't stored anywhere .但是您不能复制这些对象,因为它们的大小没有存储在任何地方 You can't search the machine code for a ret instruction, because a 0xc3 byte might appear somewhere that's not the start of an x86 instruction.您无法在机器代码中搜索ret指令,因为0xc3字节可能出现在不是 x86 指令开头的地方。 Or on any ISA, the function might have multiple ret instructions (tail duplication optimization).或者在任何 ISA 上,该函数可能有多个ret指令(尾部重复优化)。 Or end with a jmp instead of a ret (tailcall).或者以 jmp 而不是 ret (尾调用)结束。 Storing a size would start to defeat the purpose of saving size, eating up at least an extra byte in each object.存储大小将开始违背节省大小的目的,在每个对象中至少消耗一个额外的字节。

Writing code to an object at runtime, then casting to a function pointer, is undefined behaviour in ISO C and C++.在运行时将代码写入对象,然后转换为函数指针,这是 ISO C 和 C++ 中的未定义行为。 On GNU C/C++, make sure you call __builtin___clear_cache on it to sync caches or whatever else is necessary.在 GNU C/C++ 上,请确保在其上调用__builtin___clear_cache以同步缓存或其他任何必要的内容。 Yes, this is needed even on x86 to disable dead-store elimination optimizations: see this test case .是的,即使在 x86 上也需要禁用死存储消除优化: 请参阅此测试用例 On x86 it's just a compile-time thing, no extra asm.在 x86 上,它只是编译时的事情,没有额外的 asm。 It doesn't actually clear any caches.它实际上并没有清除任何缓存。

If you do copy at runtime startup, maybe allocate a big chunk of memory and carve out variable-length chunks of it, while copying.如果您在运行时启动时进行复制,则可能会在复制时分配一大块内存并切出可变长度的块。 If you malloc each separately, you're wasting memory-management overhead on it.如果分别对每个malloc分配,则会浪费内存管理开销。


This idea will not save you memory unless you have about as many functions as you have objects这个想法不会节省你的内存,除非你有和你有对象一样多的功能

Normally you have a fairly limited number of actual functions, with many objects having copies of the same function pointer.通常,您拥有的实际函数数量相当有限,许多对象都具有相同函数指针的副本。 (You've kind of hand-rolled C++ virtual functions, but with only one function you just have a function pointer directly instead of a vtable pointer to a table of pointers for that class type. One fewer levels of indirection, and apparently you're not passing the object's own address to the function.) (你有点手卷的 C++ 虚函数,但只有一个函数,你只有一个函数指针,而不是一个指向该类类型指针表的 vtable 指针。间接级别少了,显然你'不要将对象自己的地址传递给函数。)

One of the several benefits of this level of indirection is that one pointer is usually significantly smaller than the entire code for a function.这种间接级别的几个好处之一是,一个指针通常比函数的整个代码小得多。 For that to not be the case, your functions would have to be tiny .如果不是这种情况,您的函数必须是tiny

Example: with 10 different functions of 32 bytes each, and 1000 objects with function pointers, you have a total of 320 bytes of code (which will stay hot in I-cache), and 8000 bytes of function pointers.示例:有 10 个不同的函数,每个函数有 32 个字节,并且有 1000 个带有函数指针的对象,总共有 320 字节的代码(将在 I-cache 中保持热状态)和 8000 字节的函数指针。 (And in your objects, another 4 bytes per object wasted on padding to align the pointer, making the total size 16 instead of 12 bytes per object.) Anyway, that's 16320 bytes total for entire structs + code . (并且在您的对象中,每个对象又浪费了 4 个字节用于填充以对齐指针,使每个对象的总大小为 16 而不是 12 个字节。)无论如何,整个 structs + code 总共有 16320 个字节 If you allocated each object separately, there's per-object bookkeeping.如果您分别分配每个对象,则存在每个对象的簿记。

With inlining machine code into each object, and no padding, that's 1000 * (4+32) = 36000 bytes, over twice the total size.将机器代码内联到每个对象中,并且没有填充,即 1000 * (4+32) = 36000 字节,是总大小的两倍多。

x86-64 is probably a best-case scenario, where a pointer is 8 bytes and x86-64 machine code uses a (famously complex) variable-length instruction encoding which allows for high code density in some cases, especially when optimizing for code-size. x86-64 可能是最好的情况,其中一个指针是 8 个字节,而 x86-64 机器代码使用(著名的复杂)可变长度指令编码,这在某些情况下允许高代码密度,尤其是在优化代码时 -尺寸。 (eg code-golfing. https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code ). (例如代码高尔夫。https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code )。 But unless your functions are mostly something trivial like lea eax, [rdi + rdi*2] (3 bytes=opcode + ModRM + SIB) / ret (1 byte), they're still going to take more than 8 bytes.但是除非你的函数大多是像lea eax, [rdi + rdi*2] (3 bytes=opcode + ModRM + SIB) / ret (1 byte) 这样的小东西,它们仍然会占用超过 8 个字节。 (That's return x*3; for a function that takes a 32-bit integer x arg, in the x86-64 System V ABI.) (对于 x86-64 System V ABI 中采用 32 位整数x arg 的函数,这是return x*3; 。)

If they're wrappers for larger functions, a normal call rel32 instruction is 5 bytes.如果它们是更大函数的包装器,则正常的call rel32指令为 5 个字节。 A load of static data is at least 6 bytes ( opcode + modrm + rel32 for a RIP-relative addressing mode, or loading EAX specifically can use the special no-modrm encoding for an absolute address. But in x86-64 that's a 64-bit absolute unless you use an address-size prefix too, potentially causing an LCP stall in the decoders on Intel. mov eax, [32 bit absolute address] = addr32 (0x67) + opcode + abs32 = 6 bytes again, so this is worse for no benefit).静态数据的加载至少为 6 个字节(对于 RIP 相对寻址模式, opcode + modrm + rel32 ,或者专门加载 EAX 可以使用特殊的 no-modrm 编码作为绝对地址。但在 x86-64 中,这是一个 64-位绝对,除非您也使用地址大小前缀,否则可能会导致英特尔解码器中的 LCP 停顿。mov mov eax, [32 bit absolute address] = addr32 (0x67) + opcode + abs32 = 6 字节,所以情况更糟没有任何好处)。

Your function-pointer type doesn't have any args (assuming this is C++ where foo() means foo(void) in a declaration, not like old C where an empty arg list is somewhat similar to (...) ).您的函数指针类型没有任何 args(假设这是 C++,其中foo() foo(void)在声明中表示foo(void) ,而不是像旧 C 那样空的 arg 列表有点类似于(...) )。 Thus we can assume you're not passing args, so to do anything useful the functions are probably accessing some static data or making another call.因此,我们可以假设您没有传递参数,因此为了做任何有用的事情,函数可能会访问一些静态数据或进行另一个调用。


Ideas that make more sense:更有意义的想法:

  • Use an ILP32 ABI like Linux x32 , where the CPU runs in 64-bit mode but your code uses 32-bit pointers.使用像Linux x32这样的 ILP32 ABI,其中 CPU 以 64 位模式运行,但您的代码使用 32 位指针。 This would make each of your objects only 8 bytes instead of 16. Avoiding pointer-bloat is a classic use-case for x32 or ILP32 ABIs in general.这将使您的每个对象只有 8 个字节而不是 16 个字节。通常避免指针膨胀是 x32 或 ILP32 ABI 的经典用例。

    Or (yuck) compile your code as 32-bit.或者(糟糕)将您的代码编译为 32 位。 But then you have obsolete 32-bit calling conventions that pass args on the stack instead of registers, and less than half the registers, and much higher overhead for position-independent code.但是,您有过时的 32 位调用约定,它们在堆栈而不是寄存器上传递 args,并且少于一半的寄存器,并且位置无关代码的开销要高得多。 (No EIP/RIP-relative addressing.) (没有 EIP/RIP 相对寻址。)

  • Store an unsigned int table index to a table of function pointers.unsigned int表索引存储到函数指针表中。 If you have 100 functions but 10k objects, the table is only 100 pointers long.如果您有 100 个函数但有 10k 个对象,则该表只有 100 个指针长。 In asm you could index an array of code directly (computed goto style) if all the functions were padded to the same length, but in C++ you can't do that.在 asm 中,如果所有函数都填充到相同的长度,您可以直接索引代码数组(计算 goto 样式),但在 C++ 中,您不能这样做。 An extra level of indirection with a table of function pointers is probably your best bet.带有函数指针表的额外间接层可能是您最好的选择。

eg例如

void (*const fptrs[])(void) = {
    func1, func2, func3, ...
};

struct int_with_func {
    int i;
    unsigned f;
};

void bar(struct int_with_func *p) {
    fptrs[p->f] ();
}

clang/gcc -O3 output: clang/gcc -O3 输出:

 bar(int_with_func*):
    mov     eax, dword ptr [rdi + 4]            # load p->f
    jmp     qword ptr [8*rax + fptrs] # TAILCALL    # index the global table with it for a memory-indirect jmp

If you were compiling a shared library, PIE executable, or not targeting Linux, the compiler couldn't use a 32-bit absolute address to index a static array with one instruction.如果您正在编译共享库、PIE 可执行文件或不针对 Linux,则编译器无法使用 32 位绝对地址通过一条指令索引静态数组。 So there'd be a RIP-relative LEA in there and something like jmp [rcx+rax*8] .所以那里会有一个相对于 RIP 的 LEA 和类似jmp [rcx+rax*8]

This is an extra level of indirection vs. storing a function pointer in each object, but it lets you shrink each object to 8 bytes, down from 16, like using 32-bit pointers.与在每个对象中存储函数指针相比​​,这是一个额外的间接级别,但它可以让您将每个对象从 16 个字节缩小到 8 个字节,就像使用 32 位指针一样。 Or to 5 or 6 bytes, if you use an unsigned short or uint8_t and pack the structs with __attribute__((packed)) in GNU C.或者到 5 或 6 个字节,如果您使用unsigned shortuint8_t并在 GNU C 中使用__attribute__((packed))结构。

No, not really.不,不是真的。

The way to specify a function's location is to use a function pointer, which you're already doing.指定函数位置的方法是使用函数指针,您已经在这样做了。

You could make different types which have their own different member functions, but then you're back to the original problem.您可以创建具有自己不同成员函数的不同类型,但是您又回到了最初的问题。

I have in the past experimented with auto-generating (as a pre-build step, using Python) a function with a long switch statement that does the work of mapping int i to a normal function call.我过去曾尝试自动生成(作为预构建步骤,使用 Python)一个带有长switch语句的函数,该语句执行将int i映射到普通函数调用的工作。 This gets rid of the function pointers, at the expense of branching.这以分支为代价摆脱了函数指针。 I don't remember whether it ended up being worthwhile in my case and, even if I did, that wouldn't tell us whether it's worthwhile in your case.我不记得在我的情况下它最终是否值得,即使我这样做了,也不会告诉我们在你的情况下是否值得。

Because I know that in most assembly languages, functions are just "goto" directives因为我知道在大多数汇编语言中,函数只是“goto”指令

Well, it's perhaps a little more complicated than that…那么,它也许比这更复杂一点...

You might think that I'm insane to optimize the app that way您可能会认为我以这种方式优化应用程序是疯了

Perhaps.也许。 Trying to eliminate indirection is not, in itself, a bad thing, so I don't think you're wrong to try to improve this.试图消除间接性本身并不是一件坏事,所以我认为尝试改进这一点并没有错。 I just don't think that you necessarily can.我只是不认为你一定可以。

but if anyone has some pointers但如果有人有一些指示

lol哈哈

I don't understand the goal of this "optimization" is it about saving the memory?我不明白这种“优化”的目标是为了节省内存吗?

I might be misunderstanding the question, but if you just replace your function pointer with a regular function, then you'll have your struct only containing the int as data and the function-pointer being inserted by the compiler when you take the address of it, instead of stored in memory.我可能误解了这个问题,但如果你只是用一个普通函数替换你的函数指针,那么你的结构将只包含 int 作为数据和当你获取它的地址时编译器插入的函数指针,而不是存储在内存中。

So just do所以就做

struct {
    int i;
    void func();
} test;  

Then sizeof(test)==sizeof(int) should hold true if you set alignment/packing to be tight.然后sizeof(test)==sizeof(int)如果您将对齐/包装设置为紧密,则应该成立。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM