简体   繁体   English

如何有效地并行运行大量文件的XSLT转换?

[英]How can I efficiently run XSLT transformations for a large number of files in parallel?

I have to regularly transform large amount of XML files (min. 100K) within 1 folder each time (basically, from the unzipped input dataset), and I'd like to learn how to do that in the most efficient way as possible. 我必须定期(基本上是从解压缩的输入数据集中)每次在1个文件夹中转换大量的XML文件(至少100K),并且我想学习如何以最有效的方式做到这一点。 My technological stack consists of XLTs and the Saxon XSLT Java libraries, called from Bash scripts. 我的技术堆栈包括XLT和Saxon XSLT Java库,这些库从Bash脚本调用。 And it runs on an Ubuntu server with 8 cores and a Raid of SSD with 64Gb of Ram. 它运行在具有8个内核的Ubuntu服务器和具有64Gb Ram的SSD袭击中。 Keep in mind I handle nicely XSLT but I'm still in the process of learning Bash and how to distribute the loads properly for such tasks (and Java is almost just a word at that point too). 请记住,我很好地处理了XSLT,但是我仍在学习Bash以及如何正确分配负载以完成此类任务的过程(此时Java几乎也只是一个词)。

I previously created a post regarding this issue, as my approach seemed very inefficient and was actually in need of help to properly run (See this SOF post ) . 以前就此问题创建了一个帖子,因为我的方法似乎效率很低,并且实际上需要帮助才能正确运行(请参阅此SOF帖子 A lot of comments later, it makes sense to present the issue differently, therefore this post. 以后有很多评论,因此以不同的方式介绍这个问题是有意义的,因此,本帖子。 I was proposed several solutions, one currently working much better than mine, but it could still be more elegant and efficient. 向我提出了几种解决方案,其中一种解决方案目前比我的解决方案要好得多,但是它仍然可以更加优雅和高效。

Now, I'm running this : 现在,我正在运行:

printf -- '-s:%s\0' input/*.xml | xargs -P 600 -n 1 -0 java -jar saxon9he.jar -xsl:some-xslt-sheet.xsl

I set 600 processes based on some previous tests. 我根据之前的一些测试设置了600个流程。 Going higher would just throw memory errors from Java. 更高将只会引发Java的内存错误。 But it is only using between 30 to 40Gb of Ram now (all 8 cores are at 100% though). 但是现在它仅使用30到40Gb之间的Ram(尽管所有8个内核都处于100%的状态)。

To put it in a nutshell, here is all the advices/approaches I have so far : 简而言之,这是我到目前为止的所有建议/方法

  1. Splitting the whole XML files among subfolders (eg containing each 5K files), and use this as a way to run in parallel transformation scripts for each subfolder 在子文件夹中拆分整个XML文件(例如,包含每个5K文件),并以此作为在每个子文件夹中并行运行转换脚本的方式
  2. Use specifically the Saxon- EE library (allowing multithreaded execution) with the collection() function to parse the XML files 专门使用Saxon- EE (允许多线程执行)与collection()函数一起解析XML文件
  3. Set the Java environment with a lower number of tasks, or decrease the memory per process 将Java环境设置为较少的任务数,或减少每个进程的内存
  4. Specifying Saxon regarding if the XSLT sheets are compatible with libxml/libxslt (isn't it only for XSLT1.0?) 指定有关XSLT工作表是否与libxml/libxslt兼容的Saxon(不是仅用于XSLT1.0吗?)
  5. Use a specialized shell such as xmlsh 使用专用的shell,例如xmlsh

I can handle the solution #2, and it should directly enable to control the loop and load JVM only once ; 我可以处理解决方案#2,它应该可以直接启用以控制循环并仅加载JVM一次; the #1 seems more clumsy and I still need to improve in Bash (load distribution & perf, tackling relative/absolute paths) ; #1似乎更笨拙,我仍然需要改进Bash(负载分配和性能,应对相对/绝对路径); the #3, #4 and #5 are totally new to me and I may need more explanations to see how to tackle that. #3,#4和#5对我来说是全新的,我可能需要更多说明才能了解如何解决。

Any input would be greatly appreciated. 任何投入将不胜感激。

"the most efficient way possible" is asking a lot, and is not usually a reasonable objective. “最有效的方法”要求很多,通常不是一个合理的目标。 I doubt, for example, that you would be prepared to put in 6 months' effort to improve the efficiency of the process by 3%. 例如,我怀疑您是否准备花6个月的时间将流程效率提高3%。 What you are looking for is a way of doing it that meets performance targets and can be implemented with minimum effort. 您正在寻找的是一种可以达到性能目标并且可以轻松实现的方法。 And "efficiency" itself begs questions about what your metrics are. 而且“效率”本身就对您的指标提出了疑问。

I'm pretty confident that the design I have suggested, with a single transformation processing all the files using collection() and xsl:result-document (which are both parallelized in Saxon-EE) is capable of giving good results, and is likely to be a lot less work than the only other approach I would consider, which is to write a Java application to hold the "control logic": although if you're good at writing multi-threaded Java applications then you can probably get that to go a bit faster by taking advantage of your knowledge of the workload. 我非常有信心,我建议的设计只需一次转换就可以使用collection()和xsl:result-document(在Saxon-EE中并行化)处理所有文件,并且能够提供良好的结果,并且很有可能比我考虑的唯一其他方法要少得多的工作,那就是编写一个Java应用程序以容纳“控制逻辑”:尽管如果您擅长编写多线程Java应用程序,那么您可能可以将其用于充分利用您对工作量的了解,加快工作速度。

Try using the xsltproc command line tool from libxslt . 尝试使用libxsltxsltproc命令行工具 It can take multiple xml files as arguments. 它可以将多个xml文件作为参数。 To call it like that, you'll need to create an output directory first. 要这样命名,您首先需要创建一个输出目录。 Try calling it like this: 尝试这样称呼它:

mkdir output
xsltproc -o output/ some-xslt-sheet.xsl input/*.xml

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何确定将大量文件复制到外部共享文件夹的并行Java线程的理想数量? - How can I determine the ideal number of parallel java threads for copying a large set of files to an external shared folder? 如何将大量的Stringtemplate文件组织到不同的文件夹中 - How can I organize a large number of Stringtemplate files into different folders 如何在Java中有效读取大型文本文件 - How to read the large text files efficiently in java 并行读取大型XSLT字符串 - Read large XSLT string in Parallel 如何并行运行测试套件XML文件? - How may I run testing suite XML files in parallel? 如何编写大量的 for 循环 - how can I program a large number of for loops 如何获得按钮的数组#/使此运行更有效 - How can I get the Array # of a button / Make this run more efficiently 如何有效地将多个 json 文件读入 Dataframe 或 JavaRDD? - How can I efficiently read multiple json files into a Dataframe or JavaRDD? 播放完剪辑后,如何有效关闭它们? - How do I efficiently close a large number of clips as soon as they have finished playing? 在android中的并行线程中加载来自大量文本文件的数据 - Load data from large number of text files in parallel threads in android
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM