简体   繁体   English

合并两个正则表达式来截断字符串中的单词

[英]Merging two Regular Expressions to Truncate Words in Strings

I'm trying to come up with the following function that truncates string to whole words (if possible, otherwise it should truncate to chars): 我试图提出以下函数将字符串截断为整个单词(如果可能,否则它应截断为字符):

function Text_Truncate($string, $limit, $more = '...')
{
    $string = trim(html_entity_decode($string, ENT_QUOTES, 'UTF-8'));

    if (strlen(utf8_decode($string)) > $limit)
    {
        $string = preg_replace('~^(.{1,' . intval($limit) . '})(?:\s.*|$)~su', '$1', $string);

        if (strlen(utf8_decode($string)) > $limit)
        {
            $string = preg_replace('~^(.{' . intval($limit) . '}).*~su', '$1', $string);
        }

        $string .= $more;
    }

    return trim(htmlentities($string, ENT_QUOTES, 'UTF-8', true));
}

Here are some tests: 以下是一些测试:

// Iñtërnâtiônàlizætiøn and then the quick brown fox... (49 + 3 chars)
echo dyd_Text_Truncate('Iñtërnâtiônàlizætiøn and then the quick brown fox jumped overly the lazy dog and one day the lazy dog humped the poor fox down until she died.', 50, '...');

// Iñtërnâtiônàlizætiøn_and_then_the_quick_brown_fox_...  (50 + 3 chars)
echo dyd_Text_Truncate('Iñtërnâtiônàlizætiøn_and_then_the_quick_brown_fox_jumped_overly_the_lazy_dog and one day the lazy dog humped the poor fox down until she died.', 50, '...');

They both work as it is, however if I drop the second preg_replace() I get the following: 它们都按原样工作,但是如果我删除第二个preg_replace()我得到以下内容:

Iñtërnâtiônàlizætiøn_and_then_the_quick_brown_fox_jumped_overly_the_lazy_dog and one day the lazy dog humped the poor fox down until she died.... Iñtërnâtiônàlizætiøn_and_then_the_quick_brown_fox_jumped_overly_the_lazy_dog有一天,这只懒狗将这只可怜的狐狸驼得一团糟,直到她去世为止......

I can't use substr() because it only works on byte level and I don't have access to mb_substr() ATM, I've made several attempts to join the second regex with the first one but without success. 我不能使用substr()因为它只能在字节级别上工作,而且我无法访问mb_substr() ATM,我已经多次尝试将第二个正则表达式加入到第一个正则表达式但没有成功。

Please help SMS, I've been struggling with this for almost an hour. 请帮助短信,我一直在努力这一近一个小时。


EDIT: I'm sorry, I've been awake for 40 hours and I shamelessly missed this: 编辑:对不起,我已经醒了40个小时,我无耻地错过了这个:

$string = preg_replace('~^(.{1,' . intval($limit) . '})(?:\s.*|$)?~su', '$1', $string);

Still, if someone has a more optimized regex (or one that ignores the trailing space) please share: 尽管如此,如果某人有更优化的正则表达式(或忽略尾随空格的正则表达式),请分享:

"Iñtërnâtiônàlizætiøn and then "
"Iñtërnâtiônàlizætiøn_and_then_"

EDIT 2: I still can't get rid of the trailing whitespace, can someone help me out? 编辑2:我仍然无法摆脱拖尾的空白,有人可以帮助我吗?

EDIT 3: Okay, none of my edits did really work, I was being fooled by RegexBuddy - I should probably leave this to another day and get some sleep now. 编辑3:好的,我的编辑都没有真正起作用,我被RegexBuddy愚弄了 - 我应该把它留到另一天,现在睡一觉。 Off for today. 今天关闭。

Perhaps I can give you a happy morning after a long night of RegExp nightmares: 在漫长的RegExp噩梦之后,也许我可以给你一个愉快的早晨:

'~^(.{1,' . intval($limit) . '}(?<=\S)(?=\s)|.{'.intval($limit).'}).*~su'

Boiling it down: 把它煮沸:

^      # Start of String
(       # begin capture group 1
 .{1,x} # match 1 - x characters
 (?<=\S)# lookbehind, match must end with non-whitespace 
 (?=\s) # lookahead, if the next char is whitespace, match
 |      # otherwise test this:
 .{x}   # got to x chars anyway.
)       # end cap group
.*     # match the rest of the string (since you were using replace)

You could always add the |$ to the end of (?=\\s) but since your code was already checking that the string length was longer than the $limit , I didn't feel that case would be neccesary. 您总是可以将|$添加到(?=\\s)的末尾,但由于您的代码已经检查字符串长度超过$limit ,因此我觉得不需要这种情况。

Have you considered using wordwrap? 你考虑过使用wordwrap吗? ( http://us3.php.net/wordwrap ) http://us3.php.net/wordwrap

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM