简体   繁体   English

使用并行运行html2text

[英]Run html2text using parallel

I am using html2text from Github in-which I was able to run-it on all the .html files in my folder using for file in *.html; do html2text "$file" > "$file.txt"; done 我正在使用来自Github的 html2text ,我可以在文件夹中的所有.html文件上运行它,并使用for file in *.html; do html2text "$file" > "$file.txt"; done for file in *.html; do html2text "$file" > "$file.txt"; done for file in *.html; do html2text "$file" > "$file.txt"; done but it's some-what slow. for file in *.html; do html2text "$file" > "$file.txt"; done但是有些慢。 How can I use html2text with parallel on all my .html files? 如何在所有.html文件中并行使用html2text?

The original answer was: 最初的答案是:

for file in *.html
do
    html2text "$file" > "$file.txt" & 
done

The & sign at end of command tells bash to put the command in background and return control to calling place. 命令末尾的&符号告诉bash将命令放在后台并将控件返回到调用位置。

Not sure if it will work well for 1000s of files as it would spawn a new process for each file. 不知道它是否可以很好地用于1000个文件,因为它将为每个文件产生一个新的进程。


However, as OP asked for this to work for millions of files, this is obviously not feasible, as it would spawn millions of background processes, potentially hanging the machine. 但是,由于OP要求此功能可处理数百万个文件,因此这显然不可行,因为它会产生数百万个后台进程,并有可能使计算机挂起。

What you have to understand is that processing millions of files WILL take more time, exactly depending on your hardware and OS limits. 您需要了解的是,处理数百万个文件花费更多时间,这完全取决于您的硬件和操作系统限制。 Technically a million more times than single file. 从技术上讲,比单个文件多一百万倍。

The reason the above answer seemed to work for you for 100 of files instantly, was because you got command prompt back immediately. 上面的答案似乎立即为您处理100个文件的原因,是因为您立即返回了命令提示符。 However, it does not mean that the work was finished at that point, because all those background processes might be still working until they finish, even though you can do something else meanwhile. 然而,这并不意味着,这项工作在这一点上结束,因为直到他们完成所有的后台进程可能仍在工作,即使你可以做别的事情。同时。

You could theoretically divide the file list in chunks and work chunk by chunk, however, after testing this approach I do not think you will get the end result much faster than doing parallel. 从理论上讲,您可以将文件列表分成多个块,然后逐个工作地进行工作,但是,测试这种方法后,我认为最终结果不会比并行处理快得多。

So, based on the number of files you have to process, I would suggest running parallel as you yourself found out, maybe with tweaking the number of parallel jobs significantly though. 所以,根据你要处理的文件数量,我你自己发现了,可能与调整并行作业的数量显著虽然建议并行运行。

So something like this should work: 所以这样的事情应该工作:

find . -type f -name \*html > FLIST
parallel --a FLIST -j 1000 'html2text {} > {.}.txt'

Note, this is syntax for OP's Python version of html2text. 注意,这是OP的html2text的Python版本的语法。 For options using eg. 对于使用例如的选项。 Ubuntu distro available html2text binary package, please see previous edit of the answer. Ubuntu发行版提供了html2text二进制包,请参见答案的先前编辑。

This will do your html in chunks of 1000 parallel files and not use piping (which can sometimes slow down things considerably). 这将以1000个并行文件的块的形式执行html,而不使用管道(有时可能会大大减慢速度)。

If this is too slow, try increasing -j to maybe 10000 -- but then you are venturing into hardware/operating system limitations of having 10000 parallel processes spawned all the time. 如果这太慢,请尝试将-j增加到10000-但是您会遇到硬件/操作系统的局限性,即始终产生10000个并行进程。

对于处于类似情况的其他人,使用parallel可将时间减少一半以上。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM