简体   繁体   中英

DirectX 11 - AoS to SoA conversion using AVX causing corrupt vertex buffer at remapping

Hi!
I'm implementing a particle system in DirectX 11 and use Intel AVX instrinsics to update particle data as well as converting it from SoA (Structure of Array) to AoS (Array of Structures) before passing it to the IA-stage.

It seems like when I use AVX intrisincs in the remapping phase it causes the my vertex buffer, containing the particle vertices, to be corrupt and result in a crash!

I've structured my particle data in a SoA fashion:

float*      mXPosition;
float*      mYPosition;
float*      mZPosition;

I allocate alligned memory for each component

mXPosition = (float*) _aligned_malloc( NUM_PARTICLES * sizeof(float), 32 );
mYPosition = (float*) _aligned_malloc( NUM_PARTICLES * sizeof(float), 32 );
mZPosition = (float*) _aligned_malloc( NUM_PARTICLES * sizeof(float), 32 );

I create the vertex buffer using the D3D11_USAGE_DYNAMIC as well as D3D11_CPU_ACCESS_WRITE to be able to modify particle data on the CPU.

D3D11_BUFFER_DESC desc;
ZeroMemory( &desc, sizeof( desc ) );

desc.BindFlags              = D3D11_BIND_VERTEX_BUFFER;
desc.Usage                  = D3D11_USAGE_DYNAMIC;
desc.ByteWidth              = sizeof(ParticleVertex12) * NUM_PARTICLES;
desc.StructureByteStride    = sizeof(ParticleVertex12);
desc.CPUAccessFlags         = D3D11_CPU_ACCESS_WRITE;

//Allocating aligned memory for array used for maping vertices to buffer
mVertices = (float*) _aligned_malloc( ( NUM_PARTICLES * 3 ) * sizeof(float), 32 );


if( FAILED( device->CreateBuffer( &desc, &subData, &mVertexBuffer ) ) )
    return E_FAIL;

Vertex buffer is successfully created.

Remapping phase

D3D11_MAPPED_SUBRESOURCE mappedResource;
HRESULT hr = deviceContext->Map( mVertexBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0, &mappedResource );

if( SUCCEEDED( hr ) )
{
    size_t counter  = 0;
    for (int baseIndex = 0; baseIndex < NUM_PARTICLES / 8; baseIndex++)
    {
        //   Mapping from SOA-pattern to AOS-pattern 

        //Load
        __m256 xReg = _mm256_load_ps( &mXPosition[baseIndex * 8] );
        __m256 yReg = _mm256_load_ps( &mYPosition[baseIndex * 8] );
        __m256 zReg = _mm256_load_ps( &mZPosition[baseIndex * 8] );

        //Set test values
        xReg = _mm256_set_ps( 11.0f, 12.0f, 13.0f, 14.0f, 15.0f, 16.0f, 17.0f, 18.0f );
        yReg = _mm256_set_ps( 21.0f, 22.0f, 23.0f, 24.0f, 25.0f, 26.0f, 27.0f, 28.0f );
        zReg = _mm256_set_ps( 31.0f, 32.0f, 33.0f, 34.0f, 35.0f, 36.0f, 37.0f, 38.0f );

        //Shuffle
        __m256 xyReg = _mm256_shuffle_ps( xReg, yReg, _MM_SHUFFLE( 2,0,2,0 ) );
        __m256 yzReg = _mm256_shuffle_ps( yReg, zReg, _MM_SHUFFLE( 3,1,3,1 ) );
        __m256 zxReg = _mm256_shuffle_ps( zReg, xReg, _MM_SHUFFLE( 3,1,2,0 ) );

        __m256 reg03 = _mm256_shuffle_ps( xyReg, zxReg, _MM_SHUFFLE( 2, 0, 2, 0 ) );
        __m256 reg14 = _mm256_shuffle_ps( yzReg, xyReg, _MM_SHUFFLE( 3, 1, 2, 0 ) );
        __m256 reg25 = _mm256_shuffle_ps( zxReg, yzReg, _MM_SHUFFLE( 3, 1, 3, 1 ) );


        //Map, xyz
        __m128* vertexRegAOS = (__m128*)mTempPtr;

        vertexRegAOS[0] = _mm256_castps256_ps128( reg03 );  // x8,y8,z8,x7
        vertexRegAOS[1] = _mm256_castps256_ps128( reg14 );  // y7,z7,x6,y6
        vertexRegAOS[2] = _mm256_castps256_ps128( reg25 );  // z6,x5,y5,z5

        vertexRegAOS[3] = _mm256_extractf128_ps( reg03, 1 );    // x4,y4,z4,x3
        vertexRegAOS[4] = _mm256_extractf128_ps( reg14, 1 );    // y3,z3,x2,y2
        vertexRegAOS[5] = _mm256_extractf128_ps( reg25, 1 );    // z2,x1,y1,z1

        for ( int index = 0, subIndex = 0 ; index < 6; index++ )
        {
            mVertices[counter++] = vertexRegAOS[index].m128_f32[(subIndex++) % 4];
            mVertices[counter++] = vertexRegAOS[index].m128_f32[(subIndex++) % 4];
            mVertices[counter++] = vertexRegAOS[index].m128_f32[(subIndex++) % 4];
            mVertices[counter++] = vertexRegAOS[index].m128_f32[(subIndex++) % 4];
        }



    memcpy( mappedResource.pData, mVertices, sizeof( ParticleVertex12 ) * NUM_PARTICLES );
    deviceContext->Unmap( mVertexBuffer, 0 );
}

The application crashes when it hits this line

deviceContext->Unmap( mVertexBuffer, 0 );

and displays the message

D3D11 CORRUPTION: ID3D11DeviceContext::Unmap: First parameter is corrupt or NULL. [ MISCELLANEOUS CORRUPTION #13: CORRUPTED_PARAMETER1]

I may have located where the problem is but since I'm fairly new to using AVX I've not managed to solve it.

If I comment out this section:

        //Map, xyz
        __m128* vertexRegAOS = (__m128*)mTempPtr;

        vertexRegAOS[0] = _mm256_castps256_ps128( reg03 );  // x8,y8,z8,x7
        vertexRegAOS[1] = _mm256_castps256_ps128( reg14 );  // y7,z7,x6,y6
        vertexRegAOS[2] = _mm256_castps256_ps128( reg25 );  // z6,x5,y5,z5

        vertexRegAOS[3] = _mm256_extractf128_ps( reg03, 1 );    // x4,y4,z4,x3
        vertexRegAOS[4] = _mm256_extractf128_ps( reg14, 1 );    // y3,z3,x2,y2
        vertexRegAOS[5] = _mm256_extractf128_ps( reg25, 1 );    // z2,x1,y1,z1

        for ( int index = 0, subIndex = 0 ; index < 6; index++ )
        {
            mVertices[counter++] = vertexRegAOS[index].m128_f32[(subIndex++) % 4];
            mVertices[counter++] = vertexRegAOS[index].m128_f32[(subIndex++) % 4];
            mVertices[counter++] = vertexRegAOS[index].m128_f32[(subIndex++) % 4];
            mVertices[counter++] = vertexRegAOS[index].m128_f32[(subIndex++) % 4];
        }

Then it does NOT crash. The mTempPtr used at the type cast is defined like

mTempPtr = new float[6];

Any AVX experts out there who may have a clue on what I'm doing wrong? I'm thankful for any suggestions!

Thank you!

I think your bug is allocating space for six 32bit floats, and then storing six 128bit vectors of floats. You're prob. stepping on the bookkeeping data for the next allocation, leading to errors when you try to free() .

mTempPtr = new float[6];
__m128* vertexRegAOS = (__m128*)mTempPtr;
vertexRegAOS[0] = _mm_setzero_ps();
vertexRegAOS[1] = _mm_setzero_ps();  // buffer overrun here: you only had room for 2 more floats, but you store 4.
vertexRegAOS[2] = ...;  // step on more stuff
... // corrupt even more memory :P

You could save a uop or two by using a VPERM2F128 and then a single 256b store, instead of 2x VEXTRACTF128 (which apparently can't micro-fuse its store and store-data uops).

    vertexRegAOS[0] = _mm256_castps256_ps128( reg03 );  // x8,y8,z8,x7
    vertexRegAOS[1] = _mm256_castps256_ps128( reg14 );  // y7,z7,x6,y6
    vertexRegAOS[2] = _mm256_castps256_ps128( reg25 );  // z6,x5,y5,z5

    vertexRegAOS[3] = _mm256_extractf128_ps( reg03, 1 );    // x4,y4,z4,x3
    // vertexRegAOS[4] = _mm256_extractf128_ps( reg14, 1 );    // y3,z3,x2,y2
    // vertexRegAOS[5] = _mm256_extractf128_ps( reg25, 1 );    // z2,x1,y1,z1
    __m256 reg45 = _mm256_permute2f128_ps (reg14, reg25, 1|(3<<4) );
    _mm256_storeu_ps( (float*)(vertexRegAOS + 4), reg45);

Don't use 256b stores if your code has to perform decently on AMD Piledriver, though. It has a bad performance bug that makes 256b stores WAY slower than two 128b.

Also, isn't the loop where you copy from vertexRegAOS to mVertices[counter++] just a memcpy ? I don't understand why you don't just store into it directly, with unaligned stores if needed. It has no comments, and maybe I didn't spend enough time staring at it, if it doesn't actually copy every float in order.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM