简体   繁体   English

使用PHP(XPath),PHP / Python(Regexp)或Python(XPath)从html提取信息

[英]Extracting info from html using PHP(XPath), PHP/Python(Regexp) or Python(XPath)

I have approx. 我大约。 40k+ html documents where I need to extract information from. 我需要从中提取信息的40k + html文档。 I have tried to do so using PHP+Tidy(because most files are not well-formed)+DOMDocument+XPath but it is extremely slow.... I am advised to use regexp but the html files are not marked up semantically (table based layout, with meaning-less tag/classes used everywhere) and I don't know where i should start... 我尝试使用PHP + Tidy(因为大多数文件的格式不正确)+ DOMDocument + XPath来这样做,但是它的速度非常慢。...建议使用regexp,但是html文件在语义上没有标记(表基于布局,无意义的标记/类随处可见),我不知道应该从哪里开始...

Just being curious, is using regexp (PHP/Python) faster than using Python's XPath library? 只是好奇,使用regexp(PHP / Python)是否比使用Python的XPath库更快? Is Xpath library for Python generally faster than PHP's counterpart? Python的Xpath库通常比PHP的库快吗?

If speed is a requirement have a look at lxml . 如果需要速度,请查看lxml lxml is a pythonic binding for the libxml2 and libxslt C libraries. lxml是libxml2libxslt C库的pythonic绑定。 Using the C libraries is much faster than any pure php or python version. 使用C库比任何纯php或python版本都快得多。

There are some impressive benchmarks from Ian Bicking: Ian Bicking提供了一些令人印象深刻的基准测试

In Conclusion 结论

I knew lxml was fast before I started these benchmarks, but I didn't expect it to be quite this fast. 在开始这些基准测试之前,我知道lxml很快,但是我没想到它会这么快。

Parsing Results: 解析结果:

Parsing Resutls http://1.2.3.9/bmi/blog.ianbicking.org/wp-content/uploads/images/parsing-results.png 解析结果http://1.2.3.9/bmi/blog.ianbicking.org/wp-content/uploads/images/parsing-results.png

You might give Beautiful Soup in Python a try. 您可以尝试使用Python的Beautiful Soup It's a pretty great parser for generating a usable DOM out of garbage HTML. 这是一个非常不错的解析器,用于从垃圾HTML生成可用的DOM。 That with some regex skills might get you what you need. 拥有一些正则表达式技能可能会为您提供所需的东西。 Happy hunting! 狩猎愉快!

Most comparative operations in Python are faster than in PHP in my subjective experience. 在我的主观经验中,Python中的大多数比较操作都比PHP中的操作要快。 Partly due to Python being a compiled language instead of interpreted at runtime, partly due to Python having been optimized for greater efficiency by its contributors... 部分原因是Python是一种编译语言,而不是在运行时进行解释,部分原因是Python已对其贡献者进行了优化以提高效率...

Still, for 40k+ documents, find a nice fast machine ;-) 不过,对于40k多个文档,找到一台不错的快速机器;-)

As the previous post mentions Python in general is faster than php due to byte-code compilation (those .pyc files). 如前所述,由于字节码编译(那些.pyc文件),Python通常比php快。 And a lot of DOM/SAX parsers use fair bit of regexp internally anyway. 无论如何,很多DOM / SAX解析器内部都会使用相当数量的正则表达式。 Those who told you to use regexp need to be told that it is not a magic bullet. 那些告诉过您使用regexp的人必须被告知这不是灵丹妙药。 For 40k+ documents I would recommend parallelizing the task using the new multi-threads or the classic parallel python . 对于40k +文档,我建议使用新的多线程或经典的并行python并行化任务。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM