简体   繁体   中英

Why is it that we can write outside of bounds in C?

I recently finished reading about virtual memory and I have a question about how malloc works within the Virtual address space and Physical Memory.

For example (code copied from another SO post)

void main(){
int *p;
p=malloc(sizeof(int));
p[500]=999999;
printf("p[0]=%d\n",p[500]); //works just fine. 
}

Why is this allowed to happen? Or like why is that address at p[500] even writable?

Here is my guess.

When malloc is called, perhaps the OS decides to give the process an entire page. I will just assume that each page is worth 4KB of space. Is that entire thing marked as writable? That's why you can go as far as 500*sizeof(int) into the page (assuming 32bit system where int is size of 4 bytes).

I see that when I try to edit at a larger value...

   p[500000]=999999; // EXC_BAD_ACCESS according to XCode

Seg fault.

If so, then does that mean that there are pages that are dedicated to your code/instructions/text segments and marked as unwrite-able completely separate from your pages where your stack/variables are in (where things do change) and marked as writable? Of course, the process thinks they're next to each order in the 4gb address space on a 32-bit system.

"Why is this allowed to happen?" (write outside of bounds)

C does not require the additional CPU instructions that would typically be needed to prevent this out-of-range access.

That is the speed of C - it trusts the programmer, giving the coder all the rope needed to perform the task - including enough rope to hang oneself.

Consider the following code for Linux:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int staticvar;
const int constvar = 0;

int main(void)
{
        int stackvar;
        char buf[200];
        int *p;

        p = malloc(sizeof(int));
        sprintf(buf, "cat /proc/%d/maps", getpid());
        system(buf);

        printf("&staticvar=%p\n", &staticvar);
        printf("&constvar=%p\n", &constvar);
        printf("&stackvar=%p\n", &stackvar);
        printf("p=%p\n", p);
        printf("undefined behaviour: &p[500]=%p\n", &p[500]);
        printf("undefined behaviour: &p[50000000]=%p\n", &p[50000000]);

        p[500] = 999999; //undefined behaviour
        printf("undefined behaviour: p[500]=%d\n", p[500]);
        return 0;
}

It prints the memory map of the process and the addresses of some different type of memory.

[osboxes@osboxes ~]$ gcc tmp.c -g -static -Wall -Wextra -m32
[osboxes@osboxes ~]$ ./a.out
08048000-080ef000 r-xp 00000000 fd:00 919429                /home/osboxes/a.out
080ef000-080f2000 rw-p 000a6000 fd:00 919429                /home/osboxes/a.out
080f2000-080f3000 rw-p 00000000 00:00 0
0824d000-0826f000 rw-p 00000000 00:00 0                     [heap]
f779c000-f779e000 r--p 00000000 00:00 0                     [vvar]
f779e000-f779f000 r-xp 00000000 00:00 0                     [vdso]
ffe4a000-ffe6b000 rw-p 00000000 00:00 0                     [stack]
&staticvar=0x80f23a0
&constvar=0x80c2fcc
&stackvar=0xffe69b88
p=0x824e2a0
undefined behaviour: &p[500]=0x824ea70
undefined behaviour: &p[50000000]=0x1410a4a0
undefined behaviour: p[500]=999999

Or like why is that address at p[500] even writable?

Heap is from 0824d000-0826f000 and &p[500] is 0x824ea70 by chance, so the memory is writeable and readable, but this memory region may contain real data which will be altered! In the case of the sample program it is most likely that it is unused so the write to this memory is not harmful for the process to work.

&p[50000000] is 0x1410a4a0 by chance, which is not in a page the kernel mapped to the process and therefore is unwriteable and unreadable, hence the seg fault.

If you compile it with -fsanitize=address memory accesses will be checked and many but not all illegal memory accesses will be reported by AddressSanitizer . Slowdown is about two times slower than without AddressSanitizer.

[osboxes@osboxes ~]$ gcc tmp.c -g -Wall -Wextra -m32 -fsanitize=address
[osboxes@osboxes ~]$ ./a.out
[...]
undefined behaviour: &p[500]=0xf5c00fc0
undefined behaviour: &p[50000000]=0x1abc9f0
=================================================================
==2845==ERROR: AddressSanitizer: heap-buffer-overflow on address 0xf5c00fc0 at pc 0x8048972 bp 0xfff44568 sp 0xfff44558
WRITE of size 4 at 0xf5c00fc0 thread T0
    #0 0x8048971 in main /home/osboxes/tmp.c:24
    #1 0xf70a4e7d in __libc_start_main (/lib/libc.so.6+0x17e7d)
    #2 0x80486f0 (/home/osboxes/a.out+0x80486f0)

AddressSanitizer can not describe address in more detail (wild memory access suspected).
SUMMARY: AddressSanitizer: heap-buffer-overflow /home/osboxes/tmp.c:24 main
[...]
==2845==ABORTING

If so, then does that mean that there are pages that are dedicated to your code/instructions/text segments and marked as unwrite-able completely separate from your pages where your stack/variables are in (where things do change) and marked as writable?

Yes, see the output of the process' memory map above. r-xp means readable and executable, rw-p means readable and writeable.

Why is this allowed to happen?

One of the primary design goals of the C (and C++) languages is to be as run-time efficient as possible. The designers of C (or C++) could have decided to include a rule in the language specification that said "writing outside the bounds of an array must cause X to happen" (where X is some well-defined behavior, such as a crash or thrown exception)... but had they done so, every C compiler would have been required to generate bounds-checking code for every array access the C program does. Depending on the target hardware and cleverness of the compiler, enforcing a rule like that could easily make every C (or C++) program 5-10 times slower than it currently can be.

So instead of requiring the compiler to enforce array bounds, they simply indicated that writing outside the bounds of the array is undefined behavior -- which is to say, you shouldn't do it, but if you do do it, then there are no guarantees about what will happen, and anything that happens that you don't like is your problem, not theirs.

Real-world implementations are then free to do whatever they want -- for example, on an OS with memory protection you will likely see page-based behavior like you described, or in an embedded device (or on older OS's like MacOS 9, MS-DOS, or AmigaDOS) the computer may just happily let you write to anywhere in memory, because to do otherwise would make the computer too slow.

As a low-level (by modern standards) language, C (C++) expects the programmer to follow the rules, and will only mechanically enforce those rules if/when it can do so without incurring runtime overhead.

Undefined behavior .

That's what it is. You can try to write out of bounds but it's not guaranteed to work. It might work, it might not. What happens is completely undefined.

Why is this allowed to happen?

Because the C and C++ standards allow it. The languages are designed to be fast . Having to check for out of bounds accesses would require a run-time operation, which would slow the program down.

why is that address at p[500] even writable?

It just happened to be. Undefined behavior.

I see that when I try to edit at a larger value...

See? Again, it just happened to segfault.

When malloc is called, perhaps the OS decides to give the process an entire page.

Maybe, but the C and C++ standards don't require such behavior. They only require that the OS make at least the requested amount of memory available for use by the program. (If there's memory available.)

It's undefined behaviour...

  • if you try to access outside bounds anything may happen, including SIGEGV or corruption elsewhere in the stack that causes your program to produce wrong results, hang, crash later etc..

  • the memory may be writable without obvious failure on some given run for some compiler/flags/OS/day-of-the-week etc. because:

    • malloc() might actually allocate a larger sized allocated block wherein [500] can be written to (but on another run of the program, might not), or
    • [500] might be after the allocated block, but still memory accessible to the program
      • it's likely that [500] - being a relatively small increment - would still be in the heap, which might extend beyond further than the addresses that malloc calls have so-far yielded due to some earlier reservation of heap memory (eg using sbrk() ) in preparation for anticipated use
      • it's vaguely possible that [500] is "off the end of" the heap, and you end up writing to some other memory area, where eg over static data, thread-specific data (including the stack)

Why it this allowed to happen?

There's two aspects to this:

  • checking indices on every access would bloat (add extra machine code instructions) and slow down execution of the program, and generally the programmer can do some minimal validation of indices (eg validating once when a function's entered, then using the index however-many times), or generate the indices in a way that guarantees their validity (eg looping from 0 to the array size)

  • managing the memory extremely precisely, such that out-of-bounds access is reported by some CPU fault, is highly dependent on hardware and in general only possible at page boundaries (eg granularity in the 1k to 4k range), as well as taking extra instruction (whether within some enhanced malloc function or in some malloc -wrapping code) and time to orchestrate.

It's simply that in C the concept of an array is rather basic.

The assignment to p[] is in C the same as:

*(p+500)=999999;

and all the compiler does to implement that is:

fetch p;
calculate offset : multiply '500' by the sizeof(*p) -- e.g. 4 for int;
add p and the offset to get the memory address
write to that address.

In many architectures this is implementable in one or two instructions.

Note that not only does the compiler not know that the value 500 is not within the array, it doesn't actually know the array size to begin with!

In C99 and later, some work has been done to make arrays safer, but fundamentally C is a language designed to be fast to compile and fast to run, not safe.

Put another way. In Pascal, the compiler will prevent you from shooting your foot. In C++ the compiler provides ways to make it more difficult to shoot your foot, while in C the compiler doesn't even know you have a foot.

In the language described by the 1974 C Reference Manual, the meaning of int arr[10]; at file scope was "reserve a region of consecutive storage locations large enough to hold 10 values of type int , and bind the name arr to the address at the start of that region. The meaning of expression arr[someInt] would then be "multiply someInt by the size of an int , add that number of bytes to the base address of arr , and access whatever int happens to be stored at the resulting address. If someInt is in the range 0..9, the resulting address will fall within the space that was reserved when arr was declared, but the language was agnostic to whether the value would fall within that range. If on a platform where int was two bytes, a programmer happened to know that the address of some object x was 200 bytes past the starting address of arr , then an access to arr[100] would be an access to x . As to how a programmer would happen to know that x was 200 bytes past the start of arr , or why the programmer would want to use the expression arr[100] rather than x to access x , the design of the language was completely agnostic to such things.

The C Standard allows, but does not require, implementations to behave as described above unconditionally, even in cases where the address would fall outside the bounds of the array object being indexed. Code which relies upon such behavior will often be non-portable, but on some platforms may be able to accomplish some tasks more efficiently than would otherwise be possible.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM