Using memcpy on mmap'ed region crashes, a for loop does not

Question

I have an NVIDIA Tegra TK1 processor module on a carrier board with a PCI-e slot connecting to it. In that PCIe slot is an FPGA board which exposes some registers and a 64K memory area via PCIe.

On the ARM CPU of the Tegra board, a minimal Linux installation is running.

I am using /dev/mem and the mmap function to obtain user-space pointers to the register structs and the 64K memory area. The distinct register files and the memory block are all assigned addresses which are aligned and do not overlap with regards to 4KB memory pages. I explicitly map whole pages with mmap, using the result of getpagesize(), which also is 4096.

I can read/write from/to those exposed registers just fine. I can read from the memory area (64KB), doing uint32 word-by-word reads in a for loop, just fine. Ie read contents are correct.

But if I use std::memcpy on the same address range, though, the Tegra CPU freezes, always. I do not see any error message, if GDB is attached I also don't see a thing in Eclipse when trying to step over the memcpy line, it just stops hard. And I have to reset the CPU using the hardware reset button, as the remote console is frozen.

This is debug build with no optimization (-O0), using gcc-linaro-6.3.1-2017.05-i686-mingw32_arm-linux-gnueabihf. I was told the 64K region is accessible byte-wise, I did not try that explicitly.

Is there an actual (potential) problem that I need to worry about, or is there a specific reason why memcpy does not work and maybe should not be used in the first place in this scenario - and I can just carry on using my for loops and think nothing of it?

EDIT : Another effect has been observed: The original code snippet was missing a "vital" printf in the copying for loop, that came before the memory read. That removed, I don't get back valid data. I now updated the code snippet to have an extra read from the same address instead of the printf, which also yields correct data. The confusion intensifies.

Here the (I think) important excerpts of what's going on. With minor modifications, to make sense as shown, in this "de-fluffed" form.

// void* physicalAddr: PCIe "BAR0" address as reported by dmesg, added to the physical address offset of FPGA memory region
// long size: size of the physical region to be mapped 

//--------------------------------
// doing the memory mapping
//

const uint32_t pageSize = getpagesize();
assert( IsPowerOfTwo( pageSize ) );

const uint32_t physAddrNum = (uint32_t) physicalAddr;
const uint32_t offsetInPage = physAddrNum & (pageSize - 1);
const uint32_t firstMappedPageIdx = physAddrNum / pageSize;
const uint32_t lastMappedPageIdx = (physAddrNum + size - 1) / pageSize;
const uint32_t mappedPagesCount = 1 + lastMappedPageIdx - firstMappedPageIdx;
const uint32_t mappedSize = mappedPagesCount * pageSize;
const off_t targetOffset = physAddrNum & ~(off_t)(pageSize - 1);

m_fileID = open( "/dev/mem", O_RDWR | O_SYNC );
// addr passed as null means: we supply pages to map. Supplying non-null addr would mean, Linux takes it as a "hint" where to place.
void* mapAtPageStart = mmap( 0, mappedSize, PROT_READ | PROT_WRITE, MAP_SHARED, m_fileID, targetOffset );
if (MAP_FAILED != mapAtPageStart)
{
    m_userSpaceMappedAddr = (volatile void*) ( uint32_t(mapAtPageStart) + offsetInPage );
}

//--------------------------------
// Accessing the mapped memory
//

//void* m_rawData: <== m_userSpaceMappedAddr
//uint32_t* destination: points to a stack object
//int length: size in 32bit words of the stack object (a struct with only U32's in it)

// this crashes:
std::memcpy( destination, m_rawData, length * sizeof(uint32_t) );

// this does not, AND does yield correct memory contents - but only with a preceding extra read
for (int i=0; i<length; ++i)
{
    // This extra read makes the data gotten in the 2nd read below valid.
    // Commented out, the data read into destination will not be valid.
    uint32_t tmp = ((const volatile uint32_t*)m_rawData)[i];
    (void)tmp; //pacify compiler

    destination[i] = ((const volatile uint32_t*)m_rawData)[i];
}

Answer 1

Based on the description, it looks like your FPGA code is not responding correctly to load instructions that are reading from locations on your FPGA and it is causing the CPU to lock up. It's not crashing it is permanently stalled, hence the need for the hard reset. I had this problem also when debugging my PCIE logic on an FPGA.

Another indication that your logic is not responding correctly is that you need an extra read in order to get the right responses.

Your loop is doing 32-bit loads but memcpy is doing at least 64-bit loads, which changes how your logic responds. For example, it will need to use two TLPs with 32 bits of response if the first 128 bits of the completion and the next 32 bits in the second 128 bit TLP of the completion.

What I found super-useful was to add logic to log all the PCIE transactions into an SRAM and to be able to dump the SRAM out to see how the logic was behaving or misbehaving. We have a nifty utility, pcieflat , that prints one PCIE TLP per line. It even has documentation .

When the PCIE interface is not working well enough, I stream the log to a UART in hex which can be decoded by pcieflat.

This tool is also useful for debugging performance problems -- you can look at how well your DMA reads and writes are pipelined.

Alternatively, if you have integrated logic analyzer or similar on the FPGA, you can trace the activity that way. But it's nicer to have the TLPs parsed according to PCIE protocol.

Using memcpy on mmap'ed region crashes, a for loop does not

Question

1 answers

solution1
1 2018-09-20 00:13:19

Using memcpy on mmap'ed region crashes, a for loop does not

Question

1 answers

solution1 1 2018-09-20 00:13:19

solution1
1 2018-09-20 00:13:19