简体   繁体   中英

Extract all text from a HTML page without losing context

For a translation program I am trying to get a 95% accurate text from a HTML file in order to translate the sentences and links.

For example:

<div><a href="stack">Overflow</a> <span>Texts <b>go</b> here</span></div>

Should give me 2 results to translate:

Overflow

Texts <b>go</b> here

Any suggestions or commercial packages available for this problem?

I'm not exactly sure what you're asking, but look at simplehtmldom . Specifically the "Extract Contents from HTML" tab under quick start on that front page (can't link directly, sigh ). With that you can extract the text of a website without all those pesky tags.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM