[英]Performance issue with deserializing large JSON file into objects and add them to database
我正在研究 asp.net 核心 7 MVC 项目,并希望从本地存储的 json 文件(超过 600000 行)将国家/地区列表播种到数据库
这是 JSON 文件的示例
[{
"id": 1,
"name": "Afghanistan",
"iso3": "AFG",
"iso2": "AF",
"cities": [
{
"id": 141,
"name": "‘Alāqahdārī Dīshū"
},
{
"id": 53,
"name": "Aībak"
},
{
"id": 50,
"name": "Andkhoy"
},
{
"id": 136,
"name": "Āqchah"
},
{
"id": 137,
"name": "Ārt Khwājah"
},
{
"id": 51,
"name": "Asadabad"
},
{
"id": 52,
"name": "Ashkāsham"
},
{
"id": 138,
"name": "Āsmār"
},
{
"id": 54,
"name": "Baghlān"
},
{
"id": 55,
"name": "Balkh"
},
]
}
]
这就是我试图做的。
国家 Model:
public class Country
{
[Key]
[DatabaseGenerated(DatabaseGeneratedOption.None)]
public int Id { get; set; }
public string Name { get; set; }
public string ISO2 { get; set; }
public string ISO3 { get; set; }
public ICollection<City> Cities { get; set; }
}
市 Model
public class City
{
[Key]
[DatabaseGenerated(DatabaseGeneratedOption.None)]
public int Id { get; set; }
public string Name { get; set; }
public int CountryId { get; set; }
public Country Country { get; set; }
}
反序列化代码
public static async Task SeedCountries(AppDbContext context, IWebHostEnvironment web)
{
if (!(context.Countries.Count() > 0) || !(context.Cities.Count() > 0))
{
string json = await System.IO.File.ReadAllTextAsync(Path.Combine(web.ContentRootPath, "countries.json"));
var jsonObject = JArray.Parse(json);
IList<Country> countries = new List<Country>();
foreach (var item in jsonObject)
{
Country country = item.ToObject<Country>();
var citites = item["cities"] as JArray;
var citis = new City();
countries.Add(country);
foreach (var city in citites)
{
City cities = city.ToObject<City>();
cities.CountryId = country.Id;
context.Cities.Add(cities);
context.Entry(cities).State = Microsoft.EntityFrameworkCore.EntityState.Detached;
}
}
await context.Countries.AddRangeAsync(countries);
await context.SaveChangesAsync();
}
}
问题在于性能,如下图所示:
你在这里有两个问题:
您正在将整个 600000 行 JSON 文件加载到单个json
字符串中。 此字符串将远大于 85,000 字节,因此将被添加到大 object 堆中,从而导致Why Large Object Heap and why do we care?中所述的问题。 .
然后,您将该巨大的字符串解析为JArray
,它将占用更多 memory。
我估计你正在创建中间 JSON 字符串,因为你需要使用异步文件读取,但 Json.NET 的序列化程序不支持async
反序列化。 但是, JsonTextReader
确实支持通过JsonTextReader.ReadAsync()
进行异步读取,而JToken
确实支持通过Token.LoadAsync()
进行异步加载。 将它们放在一起,可以异步迭代一个巨大的 JSON 数组,将每个项目异步加载到JToken
中,然后将令牌反序列化为您的最终数组项目(此处为Country
),并使用有限的 memory 。
事实上, Deserializing to AsyncEnumerable using Newtonsoft.Json这个答案有一个扩展方法可以做到这一点:
public static partial class JsonExtensions
{
/// <summary>
/// Asynchronously load and synchronously deserialize values from a stream containing a JSON array. The root object of the JSON stream must in fact be an array, or an exception is thrown
/// </summary>
public static async IAsyncEnumerable<T?> DeserializeAsyncEnumerable<T>(Stream stream, JsonSerializerSettings? settings = default, [EnumeratorCancellation] CancellationToken cancellationToken = default)
{
// See https://stackoverflow.com/a/72502371/3744182 for the body of this method
因此,从该答案中JsonExtensions
扩展 class 中的所有代码,现在您将能够创建您的countries
地区列表,如下所示:
var fileName = Path.Combine(web.ContentRootPath, "countries.json");
var countries = new List<Country>();
await using var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, bufferSize: 4096, useAsync: true);
await foreach (var country in JsonExtensions.DeserializeAsyncEnumerable<Country>(stream))
{
if (country == null)
continue;
if (country.Cities != null)
foreach (var city in country.Cities)
{
// Do you need another instance of City? If so you may clone the current one as follows:
// city = JToken.FromObject(city).ToObject<City>();
city.CountryId = country.Id;
context.Cities.Add(city);
context.Entry(city).State = Microsoft.EntityFrameworkCore.EntityState.Detached;
}
countries.Add(country);
}
笔记:
在您的示例代码中,您似乎对Cities
进行了两次反序列化。 item.ToObject<Country>();
将反序列化Country.Cities
列表,但随后您在foreach (var city in citites)
循环中再次手动反序列化它们。 目前还不清楚你为什么要这样做,因为城市列表已经存在于国家之下,所以我从我的示例代码中删除了重复项
如果出于某种原因您确实需要为每个City
创建两个实例,则可能需要在我的示例代码中的foreach (var city in country.Cities)
循环内克隆它们(例如,通过使用 JSON 序列化程序来回传输它们) .
演示小提琴 #1在这里。
作为替代方案,正如CodeCaster在评论中所建议的,您可以切换到 System.Text.Json,它通过JsonSerializer.DeserializeAsyncEnumerable()
内置了对巨大 JSON arrays 异步反序列化的支持:
using System.Text.Json;
using System.Text.Json.Serialization;
var countries = new List<Country>();
await using var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, bufferSize: 4096, useAsync: true);
var options = new JsonSerializerOptions
{
PropertyNamingPolicy = JsonNamingPolicy.CamelCase,
//PropertyNameCaseInsensitive is required to deserialize "iso2" and "iso3" correctly
PropertyNameCaseInsensitive = true,
};
await foreach (var country in JsonSerializer.DeserializeAsyncEnumerable<Country>(stream, options))
{
if (country == null)
continue;
if (country.Cities != null)
foreach (var city in country.Cities)
{
// Do you need another instance of City? If so you may clone the current one as follows:
// city = JsonSerializer.Deserialize<City>(JsonSerializer.SerializeToUtf8Bytes(city));
city.CountryId = country.Id;
context.Cities.Add(city);
context.Entry(city).State = Microsoft.EntityFrameworkCore.EntityState.Detached;
}
countries.Add(country);
}
演示小提琴 #2在这里。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.