简体   繁体   English

c#-将值从1000的xmls保存到excel

[英]c# - save values from 1000s of xmls to excel

My C# application loops over 5000 files and then writes the values of xpaths to cells in an excel sheet. 我的C#应用​​程序循环了超过5000个文件,然后将xpath的值写入Excel工作表中的单元格。 It is quite slow processing 40files a second. 每秒处理40个文件非常慢。

After profiling I discovered that this line accounts for over 50% of all time used: 分析后,我发现此行占所有使用时间的50%以上:

XmlDocument.Load(filename);

To write to excel i loop over each xpath of each file and do: 要写入excel,我遍历每个文件的每个xpath并执行以下操作:

worksheet.Cells[row, col] = value;

Would it be more beneficial in terms of speed to load all the xmls into memory at once (they are less than 20kb each) then store them in a collection then transpose them all to excel? 就速度而言,一次将所有xml加载到内存中(每个xml小于20kb),然后将它们存储在一个集合中,然后将它们全部转置成更好的速度,会更好吗?

I understanding that multi-threading would possibly reduce performance rather than improve it as the process is IO-bound. 我了解到,由于进程受IO限制,因此多线程可能会降低性能而不是提高性能。

It might not be IO bound. 它可能不受IO限制。 Most of the time is spent constructing the XML DOM. 大部分时间都花在构造XML DOM上。 However, multi-threading would introduce a possible issue, depending on where you're writing the results to Excel. 但是,多线程会引入一个可能的问题,具体取决于将结果写入Excel的位置。 I don't know for sure, but I wouldn't be surprised if you could only access the Office objects from a single thread. 我不确定,但是如果您只能从单个线程访问Office对象,我不会感到惊讶。

You would have to add an additional step of collecting the results before writing to the Excel object. 在写入Excel对象之前,您必须添加一个收集结果的附加步骤。 This would have to be some sort of synchronized collection, with either another thread dedicated to writing to Excel, or do it after all of the files are processed. 这必须是某种同步的集合,或者是另一个专用于写入Excel的线程,或者在处理完所有文件之后执行。

Now, going back to the first point: Most of the time is spent loading the DOM. 现在,回到第一点:大部分时间都花在加载DOM上。 Based on the results from http://www.nearinfinity.com/blogs/joe_ferner/performance_linq_to_sql_vs.html If you still need DOM related methods, I would look at using XDocument instead. 基于http://www.nearinfinity.com/blogs/joe_ferner/performance_linq_to_sql_vs.html的结果,如果您仍然需要DOM相关方法,我会考虑使用XDocument。 The interface isn't that far off XmlDocument, so it should be an easy adaption. 该接口与XmlDocument相距不远,因此应该很容易适应。

For the most speed processing XML, look into XmlReader. 要以最快的速度处理XML,请查看XmlReader。 However, this does not get you any DOM functions, and can be harder to deal with than the two DOM based methods. 但是,这并不能为您提供任何DOM功能,并且比两种基于DOM的方法更难处理。

So, in short, first try converting to the XDocument methods, that might roughly double your speed. 因此,简而言之,首先尝试转换为XDocument方法,这可能会使速度大约提高一倍。 I would then look at converting the processing to multithreaded (perhaps using PLINQ over the list of files). 然后,我将研究将处理转换为多线程(也许在文件列表上使用PLINQ)。 Finally, if performance is still not enough, try using the XmlReader interface. 最后,如果性能仍然不够,请尝试使用XmlReader接口。

EDIT in response to collection types to use: 编辑以响应要使用的集合类型:

I see two basic options for this, depending on how long it takes to process the XML files. 我看到了两个基本选项,具体取决于处理XML文件所需的时间。 If it is a small percentage of the overall process (most time is spent dealing with Excel), just have a List<T> where T is some representation of the data you need to write to excel (It could even be a string if that's all you need), with the .Add methods surrounded by lock 's. 如果它只占整个过程的一小部分(大部分时间都花在处理Excel上),那么只需有一个List<T> ,其中T是您需要写入excel的数据的某种表示形式(如果您所需要的所有内容),并使用.Add方法将其lock Then once XML processing is complete, the Excel writer iterates over this collection. 然后,一旦XML处理完成,Excel编写器将遍历此集合。

Another option if XML processing takes awhile, and you're on .Net 4, look at the ConcurrentQueue class. 如果XML处理需要一段时间,并且您使用的是.Net 4,请选择ConcurrentQueue类。 This will provide thread safety on it's own (and really now that I look, one of the Concurrent collections could be used in the first case too, either ConcurrentQueue or BlockingCollection ). 这将自己提供线程安全性(实际上,现在看来,在第一种情况下也可以使用Concurrent集合之一, ConcurrentQueueBlockingCollection )。 You would then have threads running processing XML, and then a consumer thread that writes out to Excel. 然后,您将拥有运行处理XML的线程,然后有一个写入Excel的使用者线程。

A few other things. 其他一些事情。 Expanding a comment on a question, if you're doing nothing that needs Excel specific functions, you could just write out to CSV. 扩展对问题的评论,如果您不做任何需要Excel特定功能的操作,则可以写成CSV。 The library here http://www.codeproject.com/Articles/86973/C-CSV-Reader-and-Writer is rather straightforward to use, and handles embedded commas. http://www.codeproject.com/Articles/86973/C-CSV-Reader-and-Writer此处的库使用起来非常简单,并且可以处理嵌入式逗号。 The downside of this is the Big Scary Dialogs excel throws up if you try to save a CSV. 缺点是,如果您尝试保存CSV,Excel会抛出“大可怕对话框”。 These might be overcome with user training, however. 但是,可以通过用户培训来克服这些问题。

Another option would be to use the OpenXML library to generate Excel files if you're targeting at least Excel 2007 (Although Excel 2003 can read xlsx files with an addin), provided you aren't already. 另一个选择是,如果您的目标至少是Excel 2007,则可以使用OpenXML库生成Excel文件(尽管Excel 2003可以读取带有插件的xlsx文件),前提是您尚未这样做。 I imagine that, since this library manipulates XML it would be faster than dealing with Excel interop, and also safer (no dialogs from Excel, no zombie processes, etc). 我认为,由于该库可以处理XML,因此它比处理Excel互操作要快,而且也更安全(没有来自Excel的对话框,没有僵尸进程等)。

It looks to me as if most of the time is being spent in XML parsing, and there is no way of doing this without parsing the XML. 在我看来,大部分时间都花在了XML解析上,而且如果不解析XML,就无法做到这一点。 So I don't think there's an easy answer. 因此,我认为没有简单的答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM