简体   繁体   中英

How to compile and dump assembly for a c library (string.h)?

For a school project I have to do a large amount of string manipulation in assembly. Since this is a pain to do I was trying to come up with innovative ways to use already programmed string operations. My idea is to compile and dump the assembly from the string.h library in c. Then I would copy paste the dumped assembly into my program. After figuring out the memory location of each function and it's parameters I figure I would essentially be able to call up the function.

To dump the assembly I first wrote a program that included the libraries I wanted:

#include <stdio.h>
#include <string.h>

int main() {

  return 0;
}

Then I compiled and dumped the assembly using

gcc -o lib lib.c
objdump -d  *o

When I looked at the output I noticed that it didn't include any of the assembly for the libraries. My guess is there is either a compiler optimization that doesn't include unused functions, or the library output is hidden when I use objdump :

lib:    file format Mach-O 64-bit x86-64

Disassembly of section __TEXT,__text:
__text:
100000fa0:  55  pushq   %rbp
100000fa1:  48 89 e5    movq    %rsp, %rbp
100000fa4:  31 c0   xorl    %eax, %eax
100000fa6:  c7 45 fc 00 00 00 00    movl    $0, -4(%rbp)
100000fad:  5d  popq    %rbp
100000fae:  c3  retq

_main:
100000fa0:  55  pushq   %rbp
100000fa1:  48 89 e5    movq    %rsp, %rbp
100000fa4:  31 c0   xorl    %eax, %eax
100000fa6:  c7 45 fc 00 00 00 00    movl    $0, -4(%rbp)
100000fad:  5d  popq    %rbp
100000fae:  c3  retq

As a side note I am running OSX Catalina, but I can switch to Ubuntu or a different OS if it would be easier.

How can I go about dumping the asm for the string.h library?

First, let me start by saying that this is really an XY problem .

My idea is to compile and dump the assembly from the string.h library in c. Then I would copy paste the dumped assembly into my program.

You should not do that. The standard library has very meticolously optimized functions that need to be treated with care and are very , very complicated. They are, in other words, basically useless for educational purposes if you're learning assembly.

You should really just write your favorite implementation in C and then compile it.


A header file (such as string.h ) usually does not contain function definitions. It only contains their declaration. The real functions are actually already compiled into a dynamic library object which is installed in your system (that is the library itself).

When you compile a program, the compiler automatically links it to the standard C library. According to this answer , in OS X the standard library should be located at /usr/lib/libSystem.B.dylib . On Ubuntu, it's usually /lib/x86_64-linux-gnu/libc.so.6 . The following applies to both platforms without a problem.

If you want to take a look at the disassembly of a particular library function you can run objdump on the library piping it into less , and then search for the function name:

$ objdump -d /usr/lib/libSystem.B.dylib | less

When inside less , you can search by typing / followed by the name of the function, and then hit Enter and use n or N to navigate through matches.

Alternatively, you could dump the output of objdump to a file and inspect it with a text editor:

 $ objdump -d /usr/lib/libSystem.B.dylib > libSystem.disasm

The problem when doing this kind of thing is that the standard library has a lot of different and more complicated names for standard functions than the ones you see in string.h . Internally, the symbols used are different. For example, in Linux when using printf the corresponding symbol in libc is actually __printf . See here for example.

You can find the real symbol name of a standard library function by compiling a program that uses it and looking at the disassembled code, for example:

#include <string.h>
#include <stdio.h>

int main(void) {
    char s[100];

    scanf("%99s", s);
    size_t len = strlen(s);

    return 0;
}

Then run:

$ gcc prog.c
$ objdump -d a.out
...
0000000000000720 <main>:
 720:   55                      push   %rbp
 721:   48 89 e5                mov    %rsp,%rbp
 724:   48 83 ec 70             sub    $0x70,%rsp
 728:   48 8d 45 90             lea    -0x70(%rbp),%rax
 72c:   48 89 c6                mov    %rax,%rsi
 72f:   48 8d 3d ae 00 00 00    lea    0xae(%rip),%rdi        # 7e4 <_IO_stdin_used+0x4>
 736:   b8 00 00 00 00          mov    $0x0,%eax
 73b:   e8 90 fe ff ff          callq  5d0 <__isoc99_scanf@plt>
 740:   48 8d 45 90             lea    -0x70(%rbp),%rax
 744:   48 89 c7                mov    %rax,%rdi
 747:   e8 74 fe ff ff          callq  5c0 <strlen@plt>
 74c:   48 89 45 f8             mov    %rax,-0x8(%rbp)
 750:   b8 00 00 00 00          mov    $0x0,%eax
 755:   c9                      leaveq
 756:   c3                      retq
 757:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
 75e:   00 00

And you can see that in my case scanf is actually __isoc99_scanf , while strlen is unchanged.

I can then look up the disassembly of strlen , which on my system (Ubuntu) is the following:

$ objdump -d /lib/x86_64-linux-gnu/libc.so.6 | less
...
0000000000080650 <strlen@@GLIBC_2.2.5>:
   80650:       66 0f ef c0             pxor   %xmm0,%xmm0
   80654:       66 0f ef c9             pxor   %xmm1,%xmm1
   80658:       66 0f ef d2             pxor   %xmm2,%xmm2
   8065c:       66 0f ef db             pxor   %xmm3,%xmm3
   80660:       48 89 f8                mov    %rdi,%rax
   80663:       48 89 f9                mov    %rdi,%rcx
   80666:       48 81 e1 ff 0f 00 00    and    $0xfff,%rcx
   8066d:       48 81 f9 cf 0f 00 00    cmp    $0xfcf,%rcx
   80674:       77 6a                   ja     806e0 <strlen@@GLIBC_2.2.5+0x90>
   80676:       f3 0f 6f 20             movdqu (%rax),%xmm4
   8067a:       66 0f 74 e0             pcmpeqb %xmm0,%xmm4
   8067e:       66 0f d7 d4             pmovmskb %xmm4,%edx
   80682:       85 d2                   test   %edx,%edx
   80684:       74 04                   je     8068a <strlen@@GLIBC_2.2.5+0x3a>
   80686:       0f bc c2                bsf    %edx,%eax
   80689:       c3                      retq
   8068a:       48 83 e0 f0             and    $0xfffffffffffffff0,%rax
   8068e:       66 0f 74 48 10          pcmpeqb 0x10(%rax),%xmm1
   80693:       66 0f 74 50 20          pcmpeqb 0x20(%rax),%xmm2
   80698:       66 0f 74 58 30          pcmpeqb 0x30(%rax),%xmm3
   8069d:       66 0f d7 d1             pmovmskb %xmm1,%edx
   806a1:       66 44 0f d7 c2          pmovmskb %xmm2,%r8d
   806a6:       66 0f d7 cb             pmovmskb %xmm3,%ecx
   806aa:       48 c1 e2 10             shl    $0x10,%rdx
   ...
   ...

As you can see, even such a simple function is actually a seemingly impossible to understand jungle of complicated instructions, due to the numerous optimizations and manual tuning applied by the authors of glibc over the years.

What you're proposing is a bad idea because production library code is compiled with optimization turned way up. Optimized code is not impossible to understand, but it can be complicated. Why? gcc , for example, will often choose vector instructions that you probably don't want or need to learn about. It will unroll simple loops into long stretches of repetitive code. It will rearrange instructions in unintuitive orders to keep the processor pipeline full. When you're learning, these are sources of confusion.

What you can productively do to learn is compile C with light optimization.

The Godbot Compiler Explorer is nice for this. Give it little code fragments and see what different compilers do with different optimization levels. The link above shows a strlen . Here's a strcpy . Here's one that's not in the standard library at all . It advances a pointer to char to either the end of string or the first appearance of a separator character. Ie, it's a simple string parser.

Generic method:

  1. find package, for Debian-like Linux:
$ apt-file search string.h

Of course, this is glibc.

  1. get source eg glibc 2.31 and compile it:
$ ./configure --prefix=/usr --enable-kernel=4.0.0 --disable-profile --with-gnu-ld --enable-stack-protector=strong
$ make
  1. implementations of functions usually have the same name, so find source:
$ find . -type f -name "strstr.c"

this is strings/strstr.c

  1. disassemble default compiled version:
$ objdump -d string/strstr.o > strstr1.asm
  1. make new version with customized optimization: delete object file:
$ rm string/strstr.o`

remake with commands output:

$ make V=1 &>strstrr.txt
  • this is for "bash", otherwise use "script" command

get gcc command from strstrr.txt, modify it as you want (optimization, processor type...), eg change -O2 to O10, and run:

$ cd strings
$ gcc ../sysdeps/x86_64/multiarch/strstr.c -c -std=gnu11 -fgnu89-inline  -g -O10 -Wall -Wwrite-strings -Wundef -Werror -fmerge-all-constants -frounding-math -fstack-protector-strong -Wstrict-prototypes -Wold-style-definition -fmath-errno      -ftls-model=initial-exec      -I../include -I/home/yury/LFSC/cross1/src/bglibcn/string  -I/home/yury/LFSC/cross1/src/bglibcn  -I../sysdeps/unix/sysv/linux/x86_64/64  -I../sysdeps/unix/sysv/linux/x86_64  -I../sysdeps/unix/sysv/linux/x86/include -I../sysdeps/unix/sysv/linux/x86  -I../sysdeps/x86/nptl  -I../sysdeps/unix/sysv/linux/wordsize-64  -I../sysdeps/x86_64/nptl  -I../sysdeps/unix/sysv/linux/include -I../sysdeps/unix/sysv/linux  -I../sysdeps/nptl  -I../sysdeps/pthread  -I../sysdeps/gnu  -I../sysdeps/unix/inet  -I../sysdeps/unix/sysv  -I../sysdeps/unix/x86_64  -I../sysdeps/unix  -I../sysdeps/posix  -I../sysdeps/x86_64/64  -I../sysdeps/x86_64/fpu/multiarch  -I../sysdeps/x86_64/fpu  -I../sysdeps/x86/fpu/include -I../sysdeps/x86/fpu  -I../sysdeps/x86_64/multiarch  -I../sysdeps/x86_64  -I../sysdeps/x86  -I../sysdeps/ieee754/float128  -I../sysdeps/ieee754/ldbl-96/include -I../sysdeps/ieee754/ldbl-96  -I../sysdeps/ieee754/dbl-64/wordsize-64  -I../sysdeps/ieee754/dbl-64  -I../sysdeps/ieee754/flt-32  -I../sysdeps/wordsize-64  -I../sysdeps/ieee754  -I../sysdeps/generic  -I.. -I../libio -I.   -D_LIBC_REENTRANT -include /home/yury/LFSC/cross1/src/bglibcn/libc-modules.h -DMODULE_NAME=libc -include ../include/libc-symbols.h       -DTOP_NAMESPACE=glibc -o /home/yury/LFSC/cross1/src/bglibcn/string/strstr.o -MD -MP -MF /home/yury/LFSC/cross1/src/bglibcn/string/strstr.o.dt -MT /home/yury/LFSC/cross1/src/bglibcn/string/strstr.o
$ objdump -d string/strstr.o > strstr2.asm

the code will be different:

$ diff strstr1.asm strstr2.asm

So, you can copy-paste code you want into your assembler program to save your time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM