简体   繁体   中英

Performance issue with deserializing large JSON file into objects and add them to database

I'm working on asp.net core 7 MVC project, and want to seed list of countries with cities to Database from locally stored json file (more than 600000 lines)

Here is a sample of JSON file

[{
    "id": 1,
    "name": "Afghanistan",
    "iso3": "AFG",
    "iso2": "AF",
    "cities": [
      {
        "id": 141,
        "name": "‘Alāqahdārī Dīshū"
      },
      {
        "id": 53,
        "name": "Aībak"
      },
      {
        "id": 50,
        "name": "Andkhoy"
      },
      {
        "id": 136,
        "name": "Āqchah"
      },
      {
        "id": 137,
        "name": "Ārt Khwājah"
      },
      {
        "id": 51,
        "name": "Asadabad"
      },
      {
        "id": 52,
        "name": "Ashkāsham"
      },
      {
        "id": 138,
        "name": "Āsmār"
      },
      {
        "id": 54,
        "name": "Baghlān"
      },
      {
        "id": 55,
        "name": "Balkh"
      },
    ]
  } 
]

This is what I tried to do.

Country Model:

public class Country
{
    [Key]
    [DatabaseGenerated(DatabaseGeneratedOption.None)]
    public int Id { get; set; }
    public string Name { get; set; }
    public string ISO2 { get; set; }
    public string ISO3 { get; set; }

    public ICollection<City> Cities { get; set; }
}

City Model

public class City
{
    [Key]
    [DatabaseGenerated(DatabaseGeneratedOption.None)]
    public int Id { get; set; }
    public string Name { get; set; }
    public int CountryId { get; set; }
    public Country Country { get; set; }
}

Code for deserializing

public static async Task SeedCountries(AppDbContext context, IWebHostEnvironment web)
{
    if (!(context.Countries.Count() > 0) || !(context.Cities.Count() > 0))
    {
        string json = await System.IO.File.ReadAllTextAsync(Path.Combine(web.ContentRootPath, "countries.json"));
        var jsonObject = JArray.Parse(json);

        IList<Country> countries = new List<Country>();
        foreach (var item in jsonObject)
        {
            Country country = item.ToObject<Country>();
            var citites = item["cities"] as JArray;
            var citis = new City();
            countries.Add(country);
            foreach (var city in citites)
            {
                City cities = city.ToObject<City>();
                cities.CountryId = country.Id;
                context.Cities.Add(cities);
                context.Entry(cities).State = Microsoft.EntityFrameworkCore.EntityState.Detached;
            }
        }
        await context.Countries.AddRangeAsync(countries);
        await context.SaveChangesAsync();
    }
}

The problem with that is performance as its shown in this image:

图片

You have two problems here:

  1. You are loading your entire 600000 lines JSON file into a single json string. This string will be much larger than 85,000 bytes so will be added to the large object heap, causing problems described in Why Large Object Heap and why do we care? .

  2. You are then parsing that huge string into a JArray which will take up even more memory.

I reckon you are creating the intermediate JSON string because you need to use asynchronous file reading, but Json.NET's serializer does not support async deserialization. However, JsonTextReader does support asynchronous reading via JsonTextReader.ReadAsync() and JToken does support asynchronous loading via Token.LoadAsync() . Putting those together, it is possible to iterate through a huge JSON array asynchronously, load each item into a JToken asynchronously, then deserialize the token to your final array item (here Country ) with bounded memory use.

And in fact, this answer to Deserializing to AsyncEnumerable using Newtonsoft.Json has an extension method that does exactly that:

public static partial class JsonExtensions
{
    /// <summary>
    /// Asynchronously load and synchronously deserialize values from a stream containing a JSON array.  The root object of the JSON stream must in fact be an array, or an exception is thrown
    /// </summary>
    public static async IAsyncEnumerable<T?> DeserializeAsyncEnumerable<T>(Stream stream, JsonSerializerSettings? settings = default, [EnumeratorCancellation] CancellationToken cancellationToken = default)
    {
        // See https://stackoverflow.com/a/72502371/3744182 for the body of this method

So grab all the code from the JsonExtensions extension class from that answer , and now you will be able to create your countries list as follows:

var fileName = Path.Combine(web.ContentRootPath, "countries.json");

var countries = new List<Country>();

await using var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, bufferSize: 4096, useAsync: true);
await foreach (var country in JsonExtensions.DeserializeAsyncEnumerable<Country>(stream))
{
    if (country == null)
        continue;
    if (country.Cities != null)
        foreach (var city in country.Cities)
        {
            // Do you need another instance of City?  If so you may clone the current one as follows:
            // city = JToken.FromObject(city).ToObject<City>();
            city.CountryId = country.Id;
            context.Cities.Add(city);
            context.Entry(city).State = Microsoft.EntityFrameworkCore.EntityState.Detached;
        }
    countries.Add(country);
}

Notes:

  • In your sample code, you seem to be deserializing Cities twice. item.ToObject<Country>(); will deserialize the Country.Cities list, but then you deserialize them again manually in your foreach (var city in citites) loop. It's unclear why you are doing this since the list of cities is already present under the country, so I removed the duplication from my sample code

    If for some reason you do you need to create two instances of each City , you may need to clone them (eg by round-tripping them with the JSON serializer) inside the foreach (var city in country.Cities) loop in my sample code.

Demo fiddle #1 here .

As an alternative , as suggested by CodeCaster in comments , you could switch to System.Text.Json, which has built-in support for asynchronous deserialization of huge JSON arrays via JsonSerializer.DeserializeAsyncEnumerable() :

using System.Text.Json;
using System.Text.Json.Serialization;

var countries = new List<Country>();

await using var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, bufferSize: 4096, useAsync: true);
var options = new JsonSerializerOptions
{
    PropertyNamingPolicy = JsonNamingPolicy.CamelCase,
    //PropertyNameCaseInsensitive is required to deserialize "iso2" and "iso3" correctly
    PropertyNameCaseInsensitive = true,
};
await foreach (var country in JsonSerializer.DeserializeAsyncEnumerable<Country>(stream, options))
{
    if (country == null)
        continue;
    if (country.Cities != null)
        foreach (var city in country.Cities)
        {
            // Do you need another instance of City?  If so you may clone the current one as follows:
            // city = JsonSerializer.Deserialize<City>(JsonSerializer.SerializeToUtf8Bytes(city));
            city.CountryId = country.Id;
            context.Cities.Add(city);
            context.Entry(city).State = Microsoft.EntityFrameworkCore.EntityState.Detached;
        }
    countries.Add(country);
}

Demo fiddle #2 here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM