识别字符串中相邻重复字符的出现率（python）

Question

这些天我正在学习python。 尝试各种挑战来改善我的概念。

我刚刚尝试过的一个小挑战是确定给定字符串中n相邻重复字符的出现率。

我尝试如下：

def uniform_string(text):
    text = text.lower()
    i = 0
    while i < (len(text)-3):
        if text[i] == text[i+1] and text[i+1] == text[i+2] and text[i+2] == text[i+3]:
            return True
        i += 1
    return False

假设n=4 ，我现在尝试对其进行概括。

但从侧面看，这也让我思考：是否有更有效的方法来实现这一目标？ 我当前的解决方案使我的查找次数是字符串长度的4倍（这意味着，对于增加n值，这将朝O(n^2)方向增加）。

有什么更好的办法应对这样的挑战？

Answer 1

在以下函数中， n是要检查是否相等的字符数，并且要保持原始函数调用不变，还可以将n的默认值设置为4。

def uniform_string(text, n=4):
    text = text.lower()
    i = 0
    while i < (len(text)-n):
        if text[i:i + n] == text[i] * n:
            return True
        i += 1
    return False

另外，您也可以使用for循环：

def uniform_string(text, n):
    text = text.lower()

    for i in range(len(text) - n + 1):
        if text[i:i + n] == text[i] * n:
            return True

    return False

Answer 2

您可以使用groupby对连续的相同字符进行分组，并计算每个组中的字符数。 如果任何计数大于阈值，则结果为True ：

from itertools import groupby

def uniform_string(text, n):
    return any(sum(1 for _ in g) >= n for _, g in groupby(text.lower()))

print(uniform_string('fofo', 1))
print(uniform_string('fofo', 2))
print(uniform_string('fofoo', 2))

输出：

True
False
True

上面的时间复杂度是O（n） 。

Answer 3

您可以使用slice （ some_list[start:stop] ）并set s解决您的问题。

def uniform_string(text, n):
    text = text.lower()

    for i in range(len(text) - n + 1):
        if len(set(text[i:i+n])) == 1:  # all equal
            return True
    return False

如果您使用for循环而不是while循环，您的代码也将更加简洁。 :)

Answer 4

到目前为止发布的答案错过了Python更好的迭代功能之一， enumerate ：

def uniform_string(text, n):
    for i, c in enumerate(text):
        if text[i:i+4] == c * n:
            print( c, 'at', i, 'in', text )

我不确定这是否正是您要的，但这可能会给您带来帮助。

Answer 5

通配符条目的位：

优势非常快：具有4,000,000个字符的示例将立即分析

劣势取决于numpy。 令人费解

开始：

import numpy as np
a = "etr" + 1_000_000 * "zr" + "hhh" + 1_000_000 * "Ar"
np.max(np.diff(np.r_[-1, np.where(np.diff(np.frombuffer(a.encode('utf16'), dtype=np.uint16)[1:]))[0], len(a) - 1]))                                             
3

这个怎么运作：

将字符串编码为每个字符固定宽度的字节字符串
将缓冲区解释为numpy数组
计算“导数”
查找非零位置=字符更改的位置
这些之间的距离是重复数
计算最大值

更新：

这是一个混合版本，它会进行一些粗略的短路加上一些基本的基准测试，以找到最佳参数：

import numpy as np
from timeit import timeit

occ = 4
loc = (10, 20, 40, 80, 160, 320, 1000, 2000, 4000, 8000, 16000, 32000, 64000,
       125000, 250000, 500000, 1_000_000, 2_000_000)
a = ['pafoe<03' + o * 'gr' + occ * 'x' + (2_000_000 - o) * 'u1'
      + 'leto50d-fjeoa'[occ:] for o in loc]

def brute_force(a):
    np.max(np.diff(np.r_[-1, np.where(np.diff(np.frombuffer(
        a.encode('utf16'), dtype=np.uint16)[1:]))[0], len(a) - 1]))

def reverse_bisect(a, chunk, encode_all=True):
    j = 0
    i = chunk
    n = len(a)
    if encode_all:
        av = np.frombuffer(a.encode('utf16'), dtype=np.uint16)[1:]
    while j<n:
        if encode_all:
            s = av[j : j + chunk]
        else:
            s = np.frombuffer(a[j:j+chunk].encode('utf16'), dtype=np.uint16)[1:]
        if np.max(np.diff(np.r_[-1, np.where(np.diff(s))[0], len(s)-1])) >= occ:
            return True
        j += chunk - occ + 1
        chunk *= 2
    return False

leave_out = 2
out = []
print('first repeat at', loc[:-leave_out])
print('brute force {}'.format(
    (timeit('[f(a) for a in A]', number=100, globals={
        'f': brute_force, 'A': a[:-leave_out]}))))
print('hybrid (reverse bisect)')
for chunk in 2**np.arange(2, 18):
    out.append(timeit('[f(a,c,e) for a in A]', number=100, globals={
        'f': reverse_bisect, 'A': a[:-leave_out], 'c': chunk, 'e': True}))
    out.append(timeit('[f(a,c,e) for a in A]', number=100, globals={
        'f': reverse_bisect, 'A': a[:-leave_out], 'c': chunk, 'e': False}))
    print('chunk: {}, timings: encode all {} -- encode chunks {}'.format(
        chunk, out[-2], out[-1]))

样品运行：

first repeat at (10, 20, 40, 80, 160, 320, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 125000, 250000, 500000)
brute force 90.26514193788171
hybrid (reverse bisect)
chunk: 4, timings: encode all 5.257935176836327 -- encode chunks 2.3392367498017848
chunk: 8, timings: encode all 5.210895746946335 -- encode chunks 2.288218504982069
chunk: 16, timings: encode all 5.268893962958828 -- encode chunks 2.2223802611697465
chunk: 32, timings: encode all 5.109196993988007 -- encode chunks 2.1715646600350738
chunk: 64, timings: encode all 5.05742059298791 -- encode chunks 2.1255820950027555
chunk: 128, timings: encode all 5.110778157133609 -- encode chunks 2.100305920932442
chunk: 256, timings: encode all 5.058305847924203 -- encode chunks 2.153960411902517
chunk: 512, timings: encode all 5.108077083015814 -- encode chunks 2.056686638854444
chunk: 1024, timings: encode all 4.969490061048418 -- encode chunks 2.0368234540801495
chunk: 2048, timings: encode all 5.153041162993759 -- encode chunks 2.465495347045362
chunk: 4096, timings: encode all 5.28073402796872 -- encode chunks 2.173405918991193
chunk: 8192, timings: encode all 5.044360157102346 -- encode chunks 2.1234876308590174
chunk: 16384, timings: encode all 5.294338152976707 -- encode chunks 2.334656815044582
chunk: 32768, timings: encode all 5.7856643970590085 -- encode chunks 2.877617093967274
chunk: 65536, timings: encode all 7.04935942706652 -- encode chunks 4.1559580829925835
chunk: 131072, timings: encode all 7.516369879012927 -- encode chunks 4.553452031919733

first repeat at (10, 20, 40)
brute force 16.363576064119115
hybrid (reverse bisect)
chunk: 4, timings: encode all 0.6122389689553529 -- encode chunks 0.045893668895587325
chunk: 8, timings: encode all 0.5982049370650202 -- encode chunks 0.03538667503744364
chunk: 16, timings: encode all 0.5907809699419886 -- encode chunks 0.025738760828971863
chunk: 32, timings: encode all 0.5741697370540351 -- encode chunks 0.01634934707544744
chunk: 64, timings: encode all 0.5719085780438036 -- encode chunks 0.013115004170686007
chunk: 128, timings: encode all 0.5666680270805955 -- encode chunks 0.011037093820050359
chunk: 256, timings: encode all 0.5664500128477812 -- encode chunks 0.010536623885855079
chunk: 512, timings: encode all 0.5695593091659248 -- encode chunks 0.01133729494176805
chunk: 1024, timings: encode all 0.5688401609659195 -- encode chunks 0.012476094998419285
chunk: 2048, timings: encode all 0.5702746720053256 -- encode chunks 0.014690137933939695
chunk: 4096, timings: encode all 0.5782928131520748 -- encode chunks 0.01891179382801056
chunk: 8192, timings: encode all 0.5943365979474038 -- encode chunks 0.0272749038413167
chunk: 16384, timings: encode all 0.609349318081513 -- encode chunks 0.04354232898913324
chunk: 32768, timings: encode all 0.6489383969455957 -- encode chunks 0.07695812894962728
chunk: 65536, timings: encode all 0.7388215309474617 -- encode chunks 0.14061277196742594
chunk: 131072, timings: encode all 0.8899400909431279 -- encode chunks 0.2977339250501245

Answer 6

如果要告诉您每个字符都相同，则可以执行以下操作：

def uniform_string(text):
    text = text.lower()
    if text.count(text[0]) == len(text):
        return True
    return False

识别字符串中相邻重复字符的出现率（python）

问题描述

6 个解决方案

解决方案1
2 2017-02-17 06:02:43

解决方案2
1 2017-02-17 06:00:54

解决方案3
1 2017-02-17 06:03:02

解决方案4
1 2017-02-17 06:21:34

解决方案5
1 已采纳 2017-02-17 06:32:09

解决方案6
0 2017-02-17 05:56:03

识别字符串中相邻重复字符的出现率（python）

问题描述

6 个解决方案

解决方案1 2 2017-02-17 06:02:43

解决方案2 1 2017-02-17 06:00:54

解决方案3 1 2017-02-17 06:03:02

解决方案4 1 2017-02-17 06:21:34

解决方案5 1 已采纳 2017-02-17 06:32:09

解决方案6 0 2017-02-17 05:56:03

解决方案1
2 2017-02-17 06:02:43

解决方案2
1 2017-02-17 06:00:54

解决方案3
1 2017-02-17 06:03:02

解决方案4
1 2017-02-17 06:21:34

解决方案5
1 已采纳 2017-02-17 06:32:09

解决方案6
0 2017-02-17 05:56:03