简体   繁体   English

如何将包含重复 n 次的相同 substring 的字符串减少到 substring 的单个实例

[英]How to reduce a string containing the same substring repeated n times to a single instace of the substring

I have strings like 'ageage' or 'feetfeetfeet' or 'cmcmcmcmcm' and would like to reduce these to 'age' , 'feet' , and 'cm' respectively.我有像'ageage''feetfeetfeet''cmcmcmcmcm'这样的字符串,并想分别将它们减少为'age''feet''cm'

This is an intermediate step in normalization for matching across different data sources of certain classes of data fields that originally also contained numbers.这是规范化的中间步骤,用于匹配最初也包含数字的某些数据字段类别的不同数据源。 The numeric parts have been removed into a separate string.数字部分已被删除为单独的字符串。 All the unicode letters have been transliterated to lowercase ASCII letters with:所有 unicode 字母都已被音译为小写 ASCII 字母:

public static function transliterate(string $value)
{
    $transliterator = Transliterator::createFromRules(
        ':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;',
        Transliterator::FORWARD
    );
    return $transliterator->transliterate($value);
}

Also note that pluralization doesn't matter because while the examples I've provided are in English the project is normalizing mainly Turkish strings where such words would always be singular.另请注意,复数形式并不重要,因为虽然我提供的示例是英文的,但该项目主要对土耳其语字符串进行规范化,其中此类单词始终为单数。

I expect this can be done with regex.我希望这可以用正则表达式来完成。 Though I'm not entirely sure how虽然我不完全确定如何

I assume non regex is ok.我认为非正则表达式是可以的。

This method loops through half the string and tries to find a substring that if used in a str_replace returns nothing.此方法循环遍历字符串的一半,并尝试找到一个 ZE83AED3DDF4667DEC0DAAAACB2BB3BE0BZ,如果在 str_replace 中使用,则不会返回任何内容。
If we find that then the know it's a repeating word.如果我们发现了,那么就知道这是一个重复的词。

$str = 'feetfeetfeet';
$return = $str; // return full str if it fails

$len = strlen($str);

for($i = 1; $i < $len/2; $i++){
    $sub = substr($str, 0, $i);
    if(str_replace($sub, "", $str) == ""){
        $return = $sub;
        break;
    }
}

echo $return; //feet
  • This looks similar to finding longest common prefix which is also a suffix.这看起来类似于查找也是后缀的最长公共前缀。 Now, the length - longest prefix which is also a suffix is your answer.现在, length - longest prefix which is also a suffix是你的答案。 You can find the algorithm of building the prefix suffix table from this KMP pattern matching algorithm .您可以从这个KMP pattern matching algorithm

  • Time complexity is O(n) and space complexity is O(n) .时间复杂度为O(n) ,空间复杂度为O(n)

Snippet:片段:

<?php

$str = "feetfeetfeet";
$length = strlen($str);

$prefix_suffix_table = array_fill(0, $length, 0);

$j = 0;
for($i = 1; $i < $length; ++$i){
    while($j > 0 && $str[$i] != $str[$j]){
        $j = $prefix_suffix_table[$j - 1];
    }

    if($str[$i] == $str[$j]){
        $prefix_suffix_table[$i] = ++$j;
    }
}

echo substr($str, 0, $length - end($prefix_suffix_table));

Demo: http://sandbox.onlinephpfunctions.com/code/b401c75cde38a51a561b53bb0a6294eb615b208c演示: http://sandbox.onlinephpfunctions.com/code/b401c75cde38a51a561b53bb0a6294eb615b208c

Note: If your string is malformed like xyz which doesn't have a repeating substring, you can just add an additional check using str_repeat() and throw an exception if required.注意:如果您的字符串格式错误,例如xyz没有重复的 substring,您可以使用str_repeat()添加额外的检查并在需要时抛出异常。

You can also use str_split() to convert the string into array and find its unique elements and then again return implode all the unique elements together.您还可以使用str_split()将字符串转换为数组并找到其唯一元素,然后再次将所有唯一元素一起返回 implode。

<?php
$str = array_unique(str_split('ageage'));
$result = implode($str);
?>

Output Output

age

I have figured out how to do this with a regex.我已经想出了如何使用正则表达式来做到这一点。 Even though I have realized that it might not be useful for my purposes because mmmm can be both 2x mm (millimeter) or 4x m (meters).即使我已经意识到它可能对我的目的没有用,因为 mmmm 可以是 2x mm(毫米)或 4x m(米)。 Though If I only care about supporting up to 3 repetitions I can use:虽然如果我只关心支持最多 3 次重复,我可以使用:

if(preg_match('/^([a-z]*)\1{2}$/', $input, $matches)) {
    $repeating = $matches[1];
    $reps = 3;
} elseif(if(preg_match('/^([a-z]*)\1$/', $input, $matches)) {
    $repeating = $matches[1];
    $reps = 2;
} else {
    $repeating = $input;
    $reps = 1;
}

Not that the following will divide the string into the smallest prime number of repeats:并不是说以下会将字符串划分为最小的重复次数:

preg_match('/^([a-z]*)\1+$/', $input, $matches);
$repeating = $matches[1];

Here is a table of the outputs of this:这是这个输出的表格:

┌────────────┬────────────┐
│   $input   │ $repeating │
├────────────┼────────────┤
│ mm         │ m          │
│ mmm        │ m          │
│ mmmm       │ mm         │
│ mmmmm      │ m          │
│ mmmmmm     │ mmm        │
│ mmmmmmm    │ m          │
│ mmmmmmmm   │ mmmm       │
│ mmmmmmmmm  │ mmm        │
│ mmmmmmmmmm │ mmmmm      │
└────────────┴────────────┘

Because only the smalles prime subdivisions are considered因为只考虑较小的素数细分

preg_match('/^([a-z]*)\1{1,2}$/', $input, $matches)

is unsuitable as it will, like in the above table, find the repeating part of 'mmmmmm' to be 'mmm' instead of the desired mm.不合适,因为它会像上表一样找到“mmmmmm”的重复部分是“mmm”而不是所需的mm。

The three case implementation I have provided at the beginning is what I am currently using because my input is generally either age groups or dimensions for products and I have yet to see a product be described with more than three dimensions or with an age group like '11yr,12yr,13yr,14yr' though I can imagine something like the latter, however rare, eventually occurring.我在开始时提供的三个案例实现是我目前使用的,因为我的输入通常是产品的年龄组或维度,我还没有看到产品被描述为超过三个维度或年龄组,如'11yr,12yr,13yr,14yr'虽然我可以想象像后者这样的事情,无论多么罕见,最终都会发生。 Thus I will probably move away from this method and switch to extracting the units from the original string containing the numbers with preg_match_all:因此,我可能会放弃这种方法,转而使用 preg_match_all 从包含数字的原始字符串中提取单位:

preg_match_all('/([0-9]+)\s*([a-z]*)\s*/', $input, $matches)

However in case someone else is actually interested in finding the smallest repeating substring (so 'm' for 'mmmm') this can be done with a regex in a loop:但是,如果其他人实际上有兴趣找到最小的重复 substring(所以 'm' 代表 'mmmm'),这可以通过循环中的正则表达式来完成:

$repeating = $input;
while(preg_match('/^([a-z]*)\1+$/', $repeating, $matches)) {
    $repeating = $matches[1];
}

This will produce:这将产生:

┌────────────┬────────────┐
│   $input   │ $repeating │
├────────────┼────────────┤
│ mm         │ m          │
│ mmm        │ m          │
│ mmmm       │ m          │
│ mmmmm      │ m          │
│ mmmmmm     │ m          │
│ mmmmmmm    │ m          │
│ mmmmmmmm   │ m          │
│ mmmmmmmmm  │ m          │
│ mmmmmmmmmm │ m          │
│ cmcm       │ cm         │
│ cmcmcm     │ cm         │
│ cmcmcmcm   │ cm         │
│ cmcmcmcmcm │ cm         │
└────────────┴────────────┘

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何获取所有包含子字符串的文件 - How to get all files containing substring 如何仅在字符串中找到子字符串? - How to only find a substring in a string? 如何使用php在字符串中添加子字符串 - How to add a substring in a string with php 如何找到包含另一个字符串的子字符串作为php中的子字符串的字符串? - How to find the string that contains substring of another string as a substring in php? 字符串解析-用唯一的替换替换子字符串的每次重复出现 - String parsing - replace each repeated occurrence of a substring with a unique replacement 正则表达式,用于包含包含子字符串的子字符串的字符串 - Regex for strings containing substring that contains substring PHP 在子文件夹中递归搜索包含某个 substring 的文件名,并用另一个字符串重命名 substring - PHP recursively search through subfolders for filenames containing a certain substring and rename that substring with another string 如何获得两个字符串之间最重复的子字符串? - How to get the most repeated substring between two strings? php-从字符串数组中查找值并提取包含该值的子字符串 - php - finding value from array in a string and extract substring containing that value 匹配包含以4位正则表达式结尾的特定子字符串的字符串 - match string containing specific substring ending on 4 digits regex
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM