简体   繁体   中英

How can I read large (10MB) byte arrays with Entity Framework and PostgreSQL without using too much memory?

When my server application starts it reads about 20 records from my database (using Entity Framework with Npgsl to read from a PostgreSQL database). There are a couple bytea columns, one of them is pretty big. On average it holds about 2.5 MB but some records have upwards of 7MB and ideally it should be able to hold up to 20MB. In total all of the data combined for all the 20 records is 52MB. (but it should be able to handle more in the future)

I read all of these records at once, they are not stored in memory, they are sent to another server and then the DbContext is disposed.

    using (var db = new PsqlContext())
    {
        WebApi.Entities.BuilderDungeon[] builderDungeons = db.BuilderDungeons.Where(d => d.UseInGame).Include(d => d.Creator).ToArray();
    }

I don't understand why, but after I query this data the server application's memory usage goes from 159MB to 1GB and stays there forever. I'm using Visual Studio's Diagnostic Tools to try and figure out why it's taking up so much memory and it's all coming from the Npgsql.NpgsqlReadBuffer .

What am I doing wrong here?

That's how.Net memory allocation works. It allocates memory according to the number of CPU cores you have (a real pain for Kubernetes-based deployments) and also allocates more memory than it needs, roughly doubling it.

One 20 MB array holds, well, 20 MB of memory but to send it to a server you also need to serialize it, probably JSON, which does not have a byte[] type, so it will go in Base64 form which will add another %~25 memory to it, making it 25 MB on top of what you already have, all totaling to 45 MB, excluding all other allocations.

Try using something much more capable like JetBrains' dotMemory to take a snapshot, see what's on the memory, and more importantly where it was allocated.

At the lower-level ADO.NET layer, Npgsql buffers entire rows; this allows you to read columns in random-access, but means that memory usage depends on row size, which is bad when columns are huge.

You can efficiently read large (binary) columns by passing CommandBehavior.SequentialAccess . This means that rows get streamed, memory usage is fixed (and very small), but you have to access rows in the order in which they were requested, and can't read a column more than once. Unfortunately, this mode isn't very compatible with an ORM such as EF Core, so I'd recommend dropping down to ADO.NET for queries loading large rows.

You should consider storing this binary data for example in Blob Storage, DB entry should only contain information where real data is located in the BlobStorage. This allows future parallelization because for now your solution may have huge bottleneck.

The other thing is that maybe you don't need materialization at this stage so consider removing ToArray() and then process each entry at a time instead of downloading all of them at once.

Regarding the memory size - if you want to see how much memory is taken after your PG context has been disposed use GC.Collect Documentation

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM