[英]Extract all text from a HTML page without losing context
For a translation program I am trying to get a 95% accurate text from a HTML file in order to translate the sentences and links. 对于翻译程序,我试图从HTML文件中获取95%的准确文本,以便翻译句子和链接。
For example: 例如:
<div><a href="stack">Overflow</a> <span>Texts <b>go</b> here</span></div>
Should give me 2 results to translate: 应该给我2个结果翻译:
Overflow
Texts <b>go</b> here
Any suggestions or commercial packages available for this problem? 有任何建议或商业软件包可用于此问题吗?
I'm not exactly sure what you're asking, but look at simplehtmldom . 我不确定您要问什么,但请看一下simplehtmldom 。 Specifically the "Extract Contents from HTML" tab under quick start on that front page (can't link directly, sigh ). 特别是该首页快速入门下的“从HTML提取内容”选项卡( 叹息不能直接链接)。 With that you can extract the text of a website without all those pesky tags. 这样一来,您就可以提取没有所有讨厌标记的网站文本。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.