简体繁体 English

在多处理环境中读取文件的最快方法？ C＃

[英]Fastest way to read files in a multi-processing environment? C#

原文 2016-12-21 23:08:39 8 1 c#/ multithreading/ caching/ ravendb

I have the following challenge: 我有以下挑战：

I have a Azure Cloud Worker Role with many instances. 我有一个包含许多实例的Azure云工作者角色。 Every minute, each instance spins up about 20-30 threads. 每一分钟，每个实例都会旋转大约20-30个线程。 In each thread, it needs to read some metadata about how to process the thread from 3 objects. 在每个线程中，它需要读取一些有关如何处理来自3个对象的线程的元数据。 The objects/data reside in a remote RavenDb and even though RavenDb is very fast at retrieving the objects via HTTP, it is still under a considerable load from 30+ workers that are hitting it 3 times per thread per minute (about 45 requests/sec). 对象/数据驻留在远程RavenDb中，即使RavenDb通过HTTP检索对象的速度非常快，但是仍有30多名工作人员负担相当大的负载，每分钟每个线程击中3次（约45个请求/秒））。 Most of the time (like 99.999%) the data in RavenDb does not change. 大多数时候（如99.999％）RavenDb中的数据不会改变。

I've decided to implement local storage caching. 我决定实现本地存储缓存。 First, I read a tiny record which indicates if the metadata has changed (it changes VERY rarely), and then I read from local file storage instead of RavenDb, if local storage has the object cached. 首先，我读了一条小记录，表明元数据是否已经改变（它很少变化），然后我从本地文件存储而不是RavenDb读取，如果本地存储有缓存的对象。 I'm using File.ReadAllText() 我正在使用File.ReadAllText（）

This approach appears to be bogging the machine down and procesing slows down considerably. 这种方法似乎使机器停滞不前，处理速度大大减慢。 I'm guessing the disks on "Small" Worker Roles are not fast enough. 我猜测“小”工作者角色的磁盘不够快。

Is there anyway, I can have OS help me out and cache those files? 无论如何，我可以让操作系统帮助我并缓存这些文件吗？ Perhaps there is an alternative to caching of this data? 也许有一种替代缓存这些数据？

I'm looking at about ~1000 files of varying sizes ranging from 100k to 10mb in size stored on each Cloud Role instance 我正在查看每个Cloud Role实例上存储的大约1000个不同大小的文件，大小从100k到10mb不等

1 个解决方案

Not a straight answer, but three possible options: 不是直接的答案，但有三种可能的选择：

Use the built-in RavenDB caching mechanism 使用内置的RavenDB缓存机制

My initial guess is that your caching mechanism is actually hurting performance. 我最初的猜测是你的缓存机制实际上损害了性能。 The RavenDB client has caching built-in (see here for how to fine-tune it: https://ravendb.net/docs/article-page/3.5/csharp/client-api/how-to/setup-aggressive-caching ) RavenDB客户端内置了缓存（请参阅此处了解如何对其进行微调： https ：//ravendb.net/docs/article-page/3.5/csharp/client-api/how-to/setup-aggressive-caching ）

The problem you have is that the cache is local to each server. 您遇到的问题是缓存是每个服务器的本地缓存。 If server A downloaded a file before, server B will still have to fetch it if it happens to process that file the next time. 如果服务器A之前下载了文件，则服务器B仍然必须在下次处理该文件时获取它。

One possible option you could implement is divide the workload. 您可以实现的一个可能选项是划分工作量。 For example: 例如：

Server A => fetch files that start with AD 服务器A =>获取以AD开头的文件
Server B => fetch files that start with EH 服务器B =>获取以EH开头的文件
Server C => ... 服务器C => ...

This would ensure that you optimize the cache on each server. 这将确保您优化每台服务器上的缓存。

Get a bigger machine 获得更大的机器

If you still want to employ your own caching mechanism, there are two things that I imagine could be the bottleneck: 如果您仍想使用自己的缓存机制，我认为有两件事可能是瓶颈：

Disk access 磁盘访问
Deserialization of the JSON JSON的反序列化

For these issues, the only thing I can imagine would be to get bigger resources: 对于这些问题，我唯一能想到的就是获得更大的资源：

If it's the disk, use premium storage with SSD's. 如果是磁盘，请使用带SSD的高级存储。
If it's deserialization, get VM's with a bigger CPU 如果是反序列化，请使用更大的CPU获取VM

Cache files in RAM 缓存RAM中的文件

Alternatively, instead of writing the files to disk, store them in memory and get a VM with more RAM. 或者，不是将文件写入磁盘，而是将它们存储在内存中，并获得具有更多RAM的VM。 You shouldn't need THAT much RAM, since 1000 files * 10MB is still just 1 GB. 你不应该需要那么多RAM，因为1000个文件* 10MB仍然只有1 GB。 Doing this would eliminate disk access and deserialization. 这样做可以消除磁盘访问和反序列化。

But ultimately, it's probably best to first measure where the bottleneck is and see if it can be mitigated by using RavenDB's built-in caching mechanism. 但最终，最好先测量瓶颈在哪里，看看是否可以通过使用RavenDB的内置缓存机制来减轻瓶颈。