如何在不丢失html标签的情况下获取大文本的一部分？

Question

I get a big content from an API, something like this: 我从API获得了大量内容，例如：

Lorem <div class="highlighted">ipsum dolor</div> 
sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit 
esse cillum dolore eu fugiat nulla pariatur

I want to show around 10 words from this content. 我想显示大约10个单词。 And also I do not want to miss the <div class="highlighted">ipsum dolor</div> part. 而且我也不想错过<div class="highlighted">ipsum dolor</div>部分。 I mean the div and the class="highlighted" should not be removed. 我的意思是div和class="highlighted"不应删除。

I tried this function: 我试过这个功能：

 function getPartialContent($content, $words_number)
    {
        $no_tags_content = preg_replace("/\r|\n/", "", html_entity_decode(filter_var($content, FILTER_SANITIZE_STRING)));

        $words = explode(" ", $no_tags_content);
        $result = implode(" ", array_splice($words, 0, $words_number));
        return $result;
    }

The only problem is that this function removes all html tags first. 唯一的问题是此函数首先删除所有html标签。 If I don't use preg_replace to remove html tags, the result will be something like this (the div is not closed): 如果我不使用preg_replace删除html标签，结果将是这样的（div未关闭）：

Lorem sed do eiusmod tempor incididunt is that this <div class="highlighted">ipsum

which is not what I want. 这不是我想要的。

I expect the result to be with closed tags or without any tags at all. 我希望结果是带有封闭标签或根本没有任何标签。 Usually there are one or two words in the div . 通常div有一个或两个单词。 The number of words in the result is not that important. 结果中的单词数量不是那么重要。 I just want it to be short, around 10 to 15 words. 我只希望它简短，大约10到15个字。

Answer 1

You could try something like this: 您可以尝试这样的事情：

$rgxp = '/^(\W*(<[^>]+>\W*)?\w+(\W*<[^>]+>)?\W*){10,15}/';
preg_match($rgxp, $text, $mtch);
echo "\n",$mtch[0], "\n";

Expanded: 扩展：

$rgxp = '/
^             # start of line
(             # group to quantify
\W*           # ignore space & punctuation
(<[^>]+>\W*)? # optional opening tag group
\w+           # the words to count
(\W*<[^>]+>)? # optional closing tag group
\W*           # ignore space & punctuation
) {10,15}     # quantifier
/x';

如何在不丢失html标签的情况下获取大文本的一部分？

问题描述

1 个解决方案

解决方案1
0 2018-12-31 10:50:59

如何在不丢失html标签的情况下获取大文本的一部分？

问题描述

1 个解决方案

解决方案1 0 2018-12-31 10:50:59

解决方案1
0 2018-12-31 10:50:59