简体   繁体   English

将长文字分割为EOL和 <p> 标签

[英]Split long text into paragraphs on EOL and <p> tags

I need to split a long text into paragraphs in order to do some manipulation. 我需要将一个长文本分成多个段落才能进行一些操作。

The goals: 目标:

  1. Split long text into paragraphs based on any combination of newline characters, <p> , and <p id="" class="" style=""> (any combination of id, class, or style). 根据换行符<p><p id="" class="" style=""> (id,class或style的任意组合)的任意组合将长文本拆分为段落。
  2. Retain the <p tags for when I put it back together 保留<p标签,以便在我重新放在一起时使用

Here's what I have so far: 这是我到目前为止的内容:

$paragraphs = preg_split('/\r\n|\n|\r|<p?>/', $content, -1, PREG_SPLIT_NO_EMPTY);

Here are the issues with it: 这是它的问题:

  1. It doesn't capture <p class=""> 它无法捕获<p class="">
  2. It doesn't retain the <p tag 它不保留<p标签

Is there a way to accomplish this using preg_split? 有没有办法使用preg_split完成此操作?

UPDATED Example: 更新示例:

Incoming content may be: 传入内容可能是:

<p class="example">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed in odio ac diam interdum vulputate eget vel nisl. Aliquam felis nulla, porttitor ac elit eu, auctor blandit metus. Sed ut turpis quam. Fusce fermentum felis nec nulla hendrerit, sit amet euismod lectus hendrerit. Nullam malesuada est urna, non iaculis enim rhoncus sit amet. Vivamus metus arcu, consectetur at nisi vitae, suscipit finibus purus. Pellentesque pellentesque sapien mauris, ac dignissim ipsum rhoncus vitae. Proin nulla leo, ultrices ut diam in, condimentum efficitur urna.</p><p>Mauris felis felis, condimentum sed nisl commodo, suscipit commodo magna. Donec quis diam vel nibh commodo facilisis. Sed pretium purus non mi dapibus sagittis. Sed sed rutrum odio.</p>

Integer quis condimentum lectus. Pellentesque tristique ultrices nisi a auctor. Donec porta molestie dignissim. <p>Integer ut enim eget felis molestie ultrices. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Phasellus a venenatis turpis, sit amet commodo nulla. Aliquam nunc ligula, imperdiet sed eleifend a, convallis ut leo.</p> Praesent pharetra finibus quam, quis viverra augue blandit non. Ut commodo finibus dolor at volutpat. Etiam id elit cursus, luctus augue ac, iaculis purus. Vivamus posuere ex vitae orci dictum, consequat tincidunt lorem molestie. Fusce nec erat quis nibh pretium convallis. In pretium euismod augue at interdum. Sed magna elit, pellentesque sed elit eget, venenatis imperdiet dolor.

Needed array in $paragraphs: $ paragraphs中需要的数组:

$paragraphs = array(
    0 => '<p class="example">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed in odio ac diam interdum vulputate eget vel nisl. Aliquam felis nulla, porttitor ac elit eu, auctor blandit metus. Sed ut turpis quam. Fusce fermentum felis nec nulla hendrerit, sit amet euismod lectus hendrerit. Nullam malesuada est urna, non iaculis enim rhoncus sit amet. Vivamus metus arcu, consectetur at nisi vitae, suscipit finibus purus. Pellentesque pellentesque sapien mauris, ac dignissim ipsum rhoncus vitae. Proin nulla leo, ultrices ut diam in, condimentum efficitur urna.</p>',
    1 => '<p>Mauris felis felis, condimentum sed nisl commodo, suscipit commodo magna. Donec quis diam vel nibh commodo facilisis. Sed pretium purus non mi dapibus sagittis. Sed sed rutrum odio.</p>',
    2 => 'Integer quis condimentum lectus. Pellentesque tristique ultrices nisi a auctor. Donec porta molestie dignissim.',
    3 => '<p>Integer ut enim eget felis molestie ultrices. Cum sociis natoque       penatibus et magnis dis parturient montes, nascetur ridiculus mus. Phasellus a venenatis turpis, sit amet commodo nulla. Aliquam nunc ligula, imperdiet sed eleifend a, convallis ut leo.</p> Praesent pharetra finibus quam, quis viverra augue blandit non. Ut commodo finibus dolor at volutpat. Etiam id elit cursus, luctus augue ac, iaculis purus. Vivamus posuere ex vitae orci dictum, consequat tincidunt lorem molestie. Fusce nec erat quis nibh pretium convallis. In pretium euismod augue at interdum. Sed magna elit, pellentesque sed elit eget, venenatis imperdiet dolor.'
);

As a very simple workaround, you could add a line break before any <p [...] and <p> : 作为一个非常简单的解决方法,您可以在任何<p [...]<p>之前添加换行符:

$content = str_replace("<p>", "\n<p>", $content);
$content = str_replace("<p ", "\n<p ", $content);

Then you use your preg_split : 然后,您使用您的preg_split

$paragraphs = preg_split('/\r\n|\n|\r/', $content, -1, PREG_SPLIT_NO_EMPTY);

So you capture any <p [...]> and the <p> s are retained inside the paragraphs. 因此,您捕获了任何<p [...]> ,并且<p>被保留在段落中。

This one should work: 这应该工作:

$para = preg_split('~(?<=</p>)\s*|(?!\G)\s*(?=<p)~', trim($text));

the separator is described as a position preceded by a closing p tag that can match zero or more whitespace characters. 分隔符被描述为一个闭包p标记之前的位置,该标记可以匹配零个或多个空白字符。

(?<=...) is a lookbehind and means preceded by . (?<=...)是后面的意思,它的前面是 Note that a lookbehind is only a test and that the content matched inside is not a part of the whole match 请注意,向后看只是测试,内部匹配的内容不是整个匹配的一部分

\\s* means zero or more whitespace characters. \\s*表示零个或多个空格字符。

\\G is an anchor for the end of the previous match result \\G是上一个比赛结果的结尾

Note: if you want to take newlines in account as paragraph separator, you can change the pattern to: 注意:如果要考虑换行符作为段落分隔符,则可以将模式更改为:

$para = preg_split('~(?<=</p>)\s*|(?!\G)\s*(?=<p)|\h*+\s+~', trim($text));

But note that in this case the text enclosed between p tags must not contain newline characters to have a coherent result. 但请注意,在这种情况下,包含在p标记之间的文本不得包含换行符,以使结果一致。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM