简体繁体 English

使用C＃列出超过1000万条Oracle记录

[英]Listing more than 10 million records from Oracle With C#

原文 2011-11-30 09:02:28 5 4 c#/ performance/ oracle

I have a database that contains more than 100 million records. 我有一个数据库，其中包含超过1亿条记录。 I am running a query that contains more than 10 million records. 我正在运行一个包含超过一千万条记录的查询。 This process takes too much time so i need to shorten this time. 这个过程花费太多时间，所以我需要缩短这个时间。 I want to save my obtained record list as a csv file. 我想将获取的记录列表另存为csv文件。 How can I do it as quickly and optimum as possible? 我怎样才能做到最快和最优化？ Looking forward your suggestions. 期待您的建议。 Thanks. 谢谢。

4 个解决方案

I'm assuming that your query is already constrained to the rows/columns you need, and makes good use of indexing. 我假设您的查询已经限制在所需的行/列中，并且可以很好地利用索引。

At that scale, the only critical thing is that you don't try to load it all into memory at once; 在这样的规模上，唯一关键的事情是您不要尝试将所有内容立即加载到内存中； so forget about things like DataTable , and most full-fat ORMs (which typically try to associate rows with an identity-manager and/or change-manager). 因此，请忘掉诸如DataTable和大多数全脂ORM（它们通常会尝试将行与身份管理器和/或变更管理器关联）之类的东西。 You would have to use either the raw IDataReader (from DbCommand.ExecuteReader ), or any API that builds a non-buffered iterator on top of that (there are several; I'm biased towards dapper). 你将不得不为使用原始IDataReader （从DbCommand.ExecuteReader ），或建立一个任何API非缓冲迭代器最重要的是（有几个，我偏向短小精悍）。 For the purposes of writing CSV, the raw data-reader is probably fine. 出于编写CSV的目的，原始数据读取器可能很好。

Beyond that: you can't make it go much faster, since you are bandwidth constrained. 除此之外：由于带宽受限，您无法使其运行得更快。 The only way you can get it faster is to create the CSV file at the database server , so that there is no network overhead. 更快获得它的唯一方法是在数据库服务器上创建CSV文件，这样就不会造成网络开销。

Chances are pretty slim you need to do this in C#. 您很可能需要在C＃中执行此操作。 This is the domain of bulk data loading/exporting (commonly used in Data Warehousing scenarios). 这是批量数据加载/导出的领域（通常在数据仓库方案中使用）。

Many (free) tools (I imagine even Toad by Quest Software) will do this more robustly and more efficiently than you can write it in any platform. 比起您在任何平台上编写的工具，许多（免费）工具（我想甚至是Quest Software的Toad）都将更强大，更有效地执行此操作。

I have a hunch that you don't actually need this for an end-user (the simple observation is that the department secretary doesn't actually need to mail out copies of that; it is too large to be useful in that way). 我有一种预感，对于最终用户，您实际上并不需要它（简单的观察是部门秘书实际上并不需要邮寄该副本；它太大了，无法以这种方式使用）。

I suggest using the right tool for the job. 我建议为工作使用正确的工具。 And whatever you do, 不管你做什么

donot roll your own datatype conversions 不要滚动自己的数据类型转换
use CSV with quoted literals and think of escaping the double quotes inside these 将CSV与带引号的文字一起使用，并考虑转义其中的双引号
think of regional options (IOW: always use InvariantCulture for export/import!) 考虑区域选项（爱荷华州：始终使用InvariantCulture进行导出/导入！）

"This process takes too much time so i need to shorten this time. " “这个过程花费了太多时间，所以我需要缩短这个时间。”

This process consists of three sub-processes: 此过程包含三个子过程：

Retrieving > 10m records 检索> 1000万条记录
Writing records to file 将记录写入文件
Transferring records across the network (my presumption is you are working with a local client against a remote database) 跨网络传输记录（我的假设是您正在针对远程数据库使用本地客户端）

Any or all of those issues could be a bottleneck. 这些问题中的任何一个或全部都可能成为瓶颈。 So, if you want to reduce the total elapsed time you need to figure out where the time is spent. 因此，如果要减少总的经过时间，则需要弄清楚时间在哪里花费。 You will probably need to instrument your C# code to get the metrics. 您可能需要检测C＃代码以获取指标。

If it turns out the query is the problem then you will need to tune it. 如果事实证明查询是问题所在，那么您将需要对其进行调整。 Indexes won't help here as you're retrieving a large chunk of the table (> 10%), so increasing the performance of a full table scan will help. 在检索表的很大一部分（> 10％）时，索引在这里无济于事，因此提高全表扫描的性能将有所帮助。 For instance increasing the memory to avoid disk sorts. 例如增加内存以避免磁盘排序。 Parallel query could be useful (if you have Enterprise Edition and you have sufficient CPUs). 并行查询可能很有用（如果您拥有Enterprise Edition并且您有足够的CPU）。 Also check that the problem isn't a hardware issue (spindle contention, dodgy interconnects, etc). 还要检查问题是否不是硬件问题（主轴争用，不可靠的互连等）。

Can writing to a file be the problem? 可以写入文件吗？ Perhaps your disk is slow for some reason (eg fragmentation) or perhaps you're contending with other processes writing to the same directory. 也许您的磁盘由于某种原因（例如碎片）而变慢，或者您正在与其他写入同一目录的进程竞争。

Transferring large amounts of data across a network is obviously a potential bottleneck. 跨网络传输大量数据显然是潜在的瓶颈。 Are you certain you're only sending relevenat data to the client? 您确定只向客户端发送相关数据吗？

An alternative architecture: use PL/SQL to write the records to a file on the dataserver, using bulk collect to retrieve manageable batches of records, and then transfer the file to where you need it at the end, via FTP, perhaps compressing it first. 一种替代的体系结构：使用PL / SQL将记录写到数据服务器上的文件中，使用批量收集来检索可管理的记录批次，然后通过FTP将文件传输到最终需要的位置，也许首先压缩它。

The real question is why you need to read so many rows from the database (and such a large proportion of the underlying dataset). 真正的问题是，为什么需要从数据库中读取这么多行（以及很大一部分基础数据集）。 There are lots of approaches which should make this scenario avoidable, obvious ones being synchronous processing, message queueing and pre-consolidation. 有许多方法可以避免这种情况，其中明显的方法是同步处理，消息排队和预合并。

Leaving that aside for now...if you're consolidating the data or sifting it, then implementing the bulk of the logic in PL/SQL saves having to haul the data across the network (even if it's just to localhost, there's still a big overhead). 现在暂且不谈...如果您要合并或筛选数据，然后在PL / SQL中实现大量逻辑，就不必在网络上拖拉数据（即使只是到本地主机，仍然有一个大开销）。 Again if you just want to dump it out into a flat file , implementing this in C# isn't doing you any favours. 同样，如果您只想将其转储到平面文件中，则在C＃中实现此操作不会带来任何好处。