WebAssembly + concurrency: trying to set a thread-local stack from C/C++

Question

I have a re-entrant C++ function whose wasm output is not "thread-safe" when using imported shared memory, because the function makes use of an aliased stack that lives on the shared linear memory on a hardcoded position.

I'm aware that multithreading is not fully supported yet, and if I want to use multiple instances of the same module concurrently, avoiding crashing and data races it's my responsibility, but I accept the challenge.

My X problem is: My code is not thread-safe, and I need it to be by having non-overlapping stacks.

My Y problem is: I'm trying to modify the __stack_pointer so I can implement the stack separation, but it doesn't compile. I have tried with extern unsigned char __stack_pointer; but it throws me the following error:

wasm-ld: error: symbol type mismatch: __stack_pointer
>>> defined as WASM_SYMBOL_TYPE_GLOBAL in <internal>
>>> defined as WASM_SYMBOL_TYPE_DATA in senseless.o

Since maybe I'm not supposed to touch that pointer directly, I'm thinking on other solutions as well (see below).

Working example:

#define WASM_EXPORT __attribute__((visibility("default"))) extern "C"
#define WASM_IMPORT extern "C"

struct senseless
{
    unsigned c[1024];
    unsigned __attribute__((noinline)) compute(unsigned seed) { return c[seed % 1024]; }
};

WASM_EXPORT unsigned compute_senseless(unsigned seed)
{
    senseless h;
    h.c[5] = seed;
    return h.compute(seed);
}

After compilation, the resultant implementation in WAST is:

(module
  (type (;0;) (func (param i32) (result i32)))
  (import "env" "memory" (memory (;0;) 2 2 shared))
  
  (func (;0;) (type 0) (param i32) (result i32)
    (local i32)                          ; int l; // address of h
    (global.set 0                        ; SP = l (4096 = sizeof(senseless))
      (local.tee 1                           ; where l = SP - 4096
        (i32.sub (global.get 0)
                 (i32.const 4096))))
    (i32.store offset=20                 ; linear_memory[l + 5 * 4] = seed
      (local.get 1) (local.set 0))
    (i32.load                            ; return_value = linear_memory[l + (seed & 1023) * 4]
      (i32.add (local.get 1)
               (i32.shl                  
                  (i32.and (local.get 0) (i32.const 1023))
                  (i32.const 2))))
    (global.set 0                        ; SP = l + 4096
      (i32.add (local.get 1) (i32.const 4096))))
        
  (global (;0;) (mut i32) (i32.const 66576)) ; SP = 66576 (stack pointer)
  (export "compute_senseless" (func 0)))

So the function first decreases SP in 4096 bytes to allocate h , execute hc[5] = seed , push hc[seed % 2014] on the (wasm) stack, and then restore the SP.

If two instances of this module are working in parallel on two different WebWorkers, hc[5] = seed could cause a data race, because they use the same hard-coded SP as stack base.

To force each instance having it's own non-overlapping region for the stack, I need to modify the stack pointer, but I don't know how to do it. I have some ideas though (using wasi-libc or emscripten is overkilling to me):

As I said at the beginning: modifying the __stack_pointer ; but I can't make it happen (the error is shown above). Besides, I don't know if I have anything else to consider besides making it 16-byte aligned.
Clang already defines the function __wasm_init_tls , but its function body is empty and no one calls it (the symbol pops up when using the --export-all flag). I don't know if that function is usable anyhow.
I'm aware that clang has support for relocation sections, dynamic linking, etc, but I don't know how to play with all of these options either to achieve my purpose. I don't understand these options good enough.
Or... doing it by hand with a WAST function that sets the pointer and that I have to call manually just after instantiation:

// C++
void place_SP(int addr); // Implemented in WAST.

/* To be called from javascript; I'll make sure that this call never happens concurrently, that SP is multiple of 16 (stack base is 16-byte aligned), and that I have space enough to avoid stack overlapping. */
WASM_EXPORT void init_stack(int SP) { place_SP(SP); }

// place_SP.wast
(func $place_SP (param $addr i32)
      (global.set 0 (local.get $addr)))

but I'm unsure about what else do I need to write on place_SP.wast (is it a different module? should I need to specify a (type) entry?), and how should I modify my makefile to properly compile place_SP.wast and link it with senseless.o to produce a valid senseless.wasm .

ADDITIONAL INFO:

Makefile:

.SUFFIXES: .wasm

WASI_SDK := <my-wasi-sdk-path> # I'm using wasi-sdk 12.
CXX := $(WASI_SDK)/bin/clang++
LD := $(WASI_SDK)/bin/wasm-ld

CPPFLAGS := --sysroot=$(WASI_SDK)/share/wasi-sysroot
CXXFLAGS := -O3 -flto -fno-exceptions
LDFLAGS := --lto-O3 -E --no-entry --import-memory --max-memory=131072 \
       --features=atomics,bulk-memory --shared-memory

senseless.wasm: senseless.o
    $(LD) --verbose $(LDFLAGS) $< -o $@
    wasm-opt $@ -Oz -o $@
    wasm-strip $@
    wasm2wat -f --enable-threads --enable-bulk-memory $@ > $*.wast

senseless.o: senseless.cpp
    $(CXX) -v -c $(CPPFLAGS) $(CXXFLAGS) $< -o $@

clean:
    $(RM) senseless.wasm senseless.o

Compiler logging output (only what I consider to be the relevant stuff):

# Internal compiler call (I have omitted options regarding paths):
 "<wasi-sdk>/bin/clang-11" -cc1 -triple wasm32-unknown-wasi -emit-llvm-bc -flto -flto-unit -disable-free -disable-llvm-verifier -discard-value-names -main-file-name senseless.cpp -mrelocation-model static -mframe-pointer=none -fno-rounding-math -mconstructor-aliases -target-cpu generic -fvisibility hidden -debugger-tuning=gdb -v -O3 -fdeprecated-macro -ferror-limit 19 -fgnuc-version=4.2.1 -fcolor-diagnostics -vectorize-loops -vectorize-slp -o senseless.o -x c++ senseless.cpp

# Memory layout (linker output)
wasm-ld: mem: global base = 1024 # I wonder what are the first 1024 bytes used for.
wasm-ld: mem: __wasm_init_memory_flag offset=1024     size=4        align=4
wasm-ld: mem: static data = 4 # No .rodata/.bss in this example, only the (unused) i32 flag.
wasm-ld: mem: stack size  = 65536 # One page (64KiB) for the stack, from 66576 to 1040.
wasm-ld: mem: stack base  = 1040 # I wonder what are the bytes from 1028 to 1039 used for.
wasm-ld: mem: stack top   = 66576
wasm-ld: mem: heap base   = 66576 # Heap from 66576 up to 131072.
wasm-ld: mem: total pages = 2
wasm-ld: mem: max pages   = 2

NOTE: I have tried to personalize the stack size by adding -z,stack-size=131072 to LDFLAGS , so the stack size is two pages long instead of one for example (and increasing --max-memory to three pages), but none of stack base/size/top change at all.

Answer 1

Firstly, if you are doing multi-threading with emscripten then each thread will already have its own stack and its own value for __stack_pointer . Thats is part of what defines a thread.

If you still want to manipulate the stack yourself (perhaps to have many stacks within a single thread) then you can use the emscripten helper functions stackSave (to get the SP of the current thread) and stackRestore (to set the SP of the current thread).

If you are not using emscripten at all, then you are in uncharted territory (what runtime are using using? how are you starting new threads?), but the simplest way to do stack pointer manipulation would be with assembly code. See how emscripten implements these functions:

https://github.com/emscripten-core/emscripten/blob/main/system/lib/compiler-rt/stack_ops.S

So you could do something like this:

.globaltype __stack_pointer, i32

place_SP:
  .functype place_SP(i32) -> ()
  local.get 0
  global.set __stack_pointer
  end_function

Then compile that code with clang -c splace_sp.s -o place_sp.o

WebAssembly + concurrency: trying to set a thread-local stack from C/C++

Question

1 answers

solution1
1 2021-11-02 17:46:57

WebAssembly + concurrency: trying to set a thread-local stack from C/C++

Question

1 answers

solution1 1 2021-11-02 17:46:57

solution1
1 2021-11-02 17:46:57