简体   繁体   中英

Split series of paragraphs by paragraph based on strcount

I have a unique situation... I need to take a 12,000+ character string and split it into 1,000 character segments. The trick is, I need to avoid breaking paragraphs. I'm wondering if preg_match_all might be the best solution. Currently I'm using a simple str_split () by character count. I need the split to be by paragraph tags <p></p>

Has anyone done this before? Can you offer me any tips on how I can accomplish this?

Using PHP's DOMDocument ( docs ) , you can parse the HTML and then loop each paragraph, doing whatever truncation you'll need to do.

In my example code below, I assume that you'll want to remove any HTML tags from within the paragraph text before limiting the text to 1,000 characters - otherwise, the HTML tags would count as characters and you'd end up with less than 1,000 readable characters.

   // create a new DOMDocument
    $doc = new DOMDocument();

    // load the string into the DOM (this is your 12,000 character string)
    $doc->loadHTML('<p>Paragraph text</p><p>Paragraph text</p><p>Paragraph text</p><p>Paragraph text</p>');

    $paragraph_fragments = array();
    //Loop through each <p> tag in the dom and do... things to it
    foreach($doc->getElementsByTagName('p') as $paragraph) {
        // get the node's text, remove excess space and any internal HTML tags
        $text = strip_tags(trim($paragraph->nodeValue));
        // get the first 1000 characters from the string
        array_push($paragraph_fragments, substr($text, 0, 1000));
    }
    print_r($paragraph_fragments);

Simple way (assuming paragraphs are delimited by new lines).

First break up into paragraphs and then concat together.

NOTE - This example was written before HTML paragraphs were specified in the question

$hugeText = "..."

$paragraphSep = "\n"

$paragraphs = explode($paragraphSep, $hugeText);

$chunks = array();

$curChunk = '';
foreach ($paragraphs as $paragraph)
{
  // if it's ok to go over
  $curChunk .= $paragraphSep . $paragraph;    
  if (strlen($curChunk) >= 1000)
  {
     $chunks []= $curChunk;
     $curChunk = '';
  }

  // if it's not ok to go over
  if (strlen($curChunk) + strlen($paragraphSep) + strlen($paragraph) >= 1000)
  {
     $chunks []= $curChunk;
     $curChunk = $paragraph;
  }
  else
  {
     $curChunk .= $paragraphSep . $paragraph;    
  }
}

Edit: Since paragraphs are now HTML rather than text.

Basic premise still works - break apart the paragraphs, merge them back together.
Best to break apart html paragraphs using a dom parser.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM