Preserve Line Breaks - Simple HTML DOM Parser

Question

When using PHP Simple HTML DOM Parser, is it normal that line breaks
tags are stripped out?

Answer 1

I know this is old, but I was looking for this as well, and realized there was actually a built in option to turn off the removal of line breaks. No need to go editing the source.

The PHP Simple HTML Dom Parser's load function supports multiple useful parameters:

load($str, $lowercase=true, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT)

When calling the load function, simply pass false as the third parameter.

$html = new simple_html_dom();
$html->load("<html><head></head><body>stuff</body></html>", true, false);

If using file_get_html , it's the ninth parameter.

file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)

Edit: For str_get_html , it's the fifth parameter (Thanks yitwail)

str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)

Answer 2

Was struggling with this as well, since I needed the HTML to be easily editable after processing.

Apparently there's a boolean in the SimpleHTMLDOM script $stripRN , that's set to true on default. It strips the \\r , \\n or \\r\\n tags in the HTML.

Set the var to false (several occurences in the script..) and your problem is solved.

Answer 3

You don't have to change all $stripRN to false, the only one that affects this behavior is at line 816 ``:

// load html from string
function load($str, $lowercase=true, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT) {

Also consider to change line 988, because multibyte functions often are not installed on machines that do not deal with non-wester-european languages. Original line in v1.5 breaks the script immediately:

if (function_exists('mb_detect_encoding')) { $charset = mb_detect_encoding($this->root->plaintext . "ascii", $encoding_list = array( "UTF-8", "CP1252" ) ); } else $charset === false;

Answer 4

If you were passing by here wondering if you can do the same thing in DomDocument then I'm please to say you can! - but it's a bit dirty :(

I had a snippet of code I wanted to tidy but retain the exact line breaks it contained (\\n). This is what I did....

// NOTE: If you're HTML isn't a full HTML document then expect DomDocument to
// start creating its own DOCTYPE, head and body tags.


// Convert \n into a pretend tag
$myContent = preg_replace("/[\n]/","<img src=\"slashN\" />",$myContent);

// Do your DOM stuff...
$dom = new DOMDocument;
$dom->loadHTML($myContent);
$dom->formatOutput = true;

$myContent = $dom->saveHTML();

// Remove the \n's that DOMDocument put in itself
$myContent = preg_replace("/[\n]/","",$myContent);

// Put my own \n's back
$myContent = preg_replace("/<img src=\"slashN\" \/>/i","\n",$myContent);

It's important to note that I know, without a shadow of a doubt that my input contained only \\n. You may want your own variations if \\r\\n or \\t needs to be accounted for. eg slash.T or slash.RN etc

Answer 5

Another option should one wish to preserve other formatting such as paragraphs & headings is to use innertext rather than plaintext then perform your own string cleaning with the result.

I realise there is a performance hit but it does allow for more granular control.

Preserve Line Breaks - Simple HTML DOM Parser

Question

5 answers

solution1
54 2012-02-22 02:44:57

solution2
21 2011-09-29 13:49:28

solution3
2 2011-11-15 23:27:27

solution4
1 2014-02-28 22:14:21

solution5
-2 2012-03-25 11:33:30

Preserve Line Breaks - Simple HTML DOM Parser

Question

5 answers

solution1 54 2012-02-22 02:44:57

solution2 21 2011-09-29 13:49:28

solution3 2 2011-11-15 23:27:27

solution4 1 2014-02-28 22:14:21

solution5 -2 2012-03-25 11:33:30

solution1
54 2012-02-22 02:44:57

solution2
21 2011-09-29 13:49:28

solution3
2 2011-11-15 23:27:27

solution4
1 2014-02-28 22:14:21

solution5
-2 2012-03-25 11:33:30