Extract all text from a HTML page without losing context

Question

For a translation program I am trying to get a 95% accurate text from a HTML file in order to translate the sentences and links.

For example:

<div><a href="stack">Overflow</a> <span>Texts <b>go</b> here</span></div>

Should give me 2 results to translate:

Overflow

Texts <b>go</b> here

Any suggestions or commercial packages available for this problem?

Answer 1

I'm not exactly sure what you're asking, but look at simplehtmldom . Specifically the "Extract Contents from HTML" tab under quick start on that front page (can't link directly, sigh ). With that you can extract the text of a website without all those pesky tags.

Extract all text from a HTML page without losing context

Question

1 answers

solution1
0 2010-05-07 03:07:04

Extract all text from a HTML page without losing context

Question

1 answers

solution1 0 2010-05-07 03:07:04

solution1
0 2010-05-07 03:07:04