I have a unique situation... I need to take a 12,000+ character string and split it into 1,000 character segments. The trick is, I need to avoid breaking paragraphs. I'm wondering if preg_match_all might be the best solution. Currently I'm using a simple str_split () by character count. I need the split to be by paragraph tags <p></p>
Has anyone done this before? Can you offer me any tips on how I can accomplish this?
Using PHP's DOMDocument
( docs ) , you can parse the HTML and then loop each paragraph, doing whatever truncation you'll need to do.
In my example code below, I assume that you'll want to remove any HTML tags from within the paragraph text before limiting the text to 1,000 characters - otherwise, the HTML tags would count as characters and you'd end up with less than 1,000 readable characters.
// create a new DOMDocument
$doc = new DOMDocument();
// load the string into the DOM (this is your 12,000 character string)
$doc->loadHTML('<p>Paragraph text</p><p>Paragraph text</p><p>Paragraph text</p><p>Paragraph text</p>');
$paragraph_fragments = array();
//Loop through each <p> tag in the dom and do... things to it
foreach($doc->getElementsByTagName('p') as $paragraph) {
// get the node's text, remove excess space and any internal HTML tags
$text = strip_tags(trim($paragraph->nodeValue));
// get the first 1000 characters from the string
array_push($paragraph_fragments, substr($text, 0, 1000));
}
print_r($paragraph_fragments);
Simple way (assuming paragraphs are delimited by new lines).
First break up into paragraphs and then concat together.
NOTE - This example was written before HTML paragraphs were specified in the question
$hugeText = "..."
$paragraphSep = "\n"
$paragraphs = explode($paragraphSep, $hugeText);
$chunks = array();
$curChunk = '';
foreach ($paragraphs as $paragraph)
{
// if it's ok to go over
$curChunk .= $paragraphSep . $paragraph;
if (strlen($curChunk) >= 1000)
{
$chunks []= $curChunk;
$curChunk = '';
}
// if it's not ok to go over
if (strlen($curChunk) + strlen($paragraphSep) + strlen($paragraph) >= 1000)
{
$chunks []= $curChunk;
$curChunk = $paragraph;
}
else
{
$curChunk .= $paragraphSep . $paragraph;
}
}
Edit: Since paragraphs are now HTML rather than text.
Basic premise still works - break apart the paragraphs, merge them back together.
Best to break apart html paragraphs using a dom parser.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.