简体   繁体   English

Python正则表达式灾难性回溯

[英]Python regex catastrophic backtracking

I am searching an XML file generated from Ms word for some phrases. 我正在搜索从Ms word生成的XML文件中的某些短语。 The thing is that any phrase can be interrupted with some XML tags, that can come between words, or even inside words, as you can see in the example: 问题在于,任何短语都可以被一些XML标签打断,这些标签可能出现在单词之间,甚至可能出现在单词之间,如下例所示:

</w:rPr><w:t> To i</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:lang w:fareast="JA"/></w:rPr><w:t>ncrease knowledge of and acquired skills for implementing social policies with a view to strengthening the capacity of developing countries at the national and community level.</w:t></w:r></w:p>

So my approach to handle this problem was to simply reduce all XML tags into clusters of # characters of the same length, so that when I can find any phrase, the regex would ignore all the XML tags between each two characters. 因此,我处理此问题的方法是简单地将所有XML标记缩减为相同长度的#字符集群,这样当我找到任何短语时,正则表达式将忽略每两个字符之间的所有XML标记。

What I need basically is the span of this phrase within the actual xml document, so I will use this span into later processing with the xml document, I cannot use clones. 我需要的基本上是这个短语在实际的xml文档中的跨度,所以我将使用这个span来进行后续处理xml文档,我不能使用克隆。

This approach works remarkablly, but some phrases cause catastropic backtracking, such as the following example, so I need someone to point out where does the backtracking come from, or suggest a better solution to the problem. 这种方法非常有效,但是有些短语导致灾难性的回溯,例如下面的例子,所以我需要有人指出回溯来自哪里,或者建议更好地解决问题。

================================ ================================

Here is an example: 这是一个例子:

I have this text where there are some clusters of # characters within it (which I want to keep), and the spaces are also unpredictable, such as the following: 我有这个文本,其中有一些#字符簇(我想保留),并且空格也是不可预测的,如下所示:

Relationship to the #################strategic framework ################## for the period 2014-2015####################: Programme 7, Economic and Social Affairs, subprogramme 3, expected 2014 - 2015年期间与#################战略框架##################的关系#### ################:方案7,经济和社会事务,次级方案3,预期

accomplishment (c)####### 成就(c)#######

In order to match the following phrase: 为了匹配以下短语:

Relationship to the strategic framework for the period 2014-2015: programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c) 与2014 - 2015年期间战略框架的关系:方案7,经济和社会事务,次级方案3,预期成绩(c)

I came up with this regex to accommodate the unpredictable # and space characters: 我想出了这个正则表达式来容纳不可预知的#和空格字符:

u'R#*e#*l#*a#*t#*i#*o#*n#*s#*h#*i#*p#*\\\\s*#*t#*o#*\\\\s*#*t#*h#*e#*\\\\s*#*s#*t#*r#*a#*t#*e#*g#*i#*c#*\\\\s*#*f#*r#*a#*m#*e#*w#*o#*r#*k#*\\\\s*#*f#*o#*r#*\\\\s*#*t#*h#*e#*\\\\s*#*p#*e#*r#*i#*o#*d#*\\\\s*#*2#*0#*1#*4#*\\\\-#*2#*0#*1#*5#*:#*\\\\s*#*p#*r#*o#*g#*r#*a#*m#*m#*e#*\\\\s*#*7#*\\\\,#*\\\\s*#*E#*c#*o#*n#*o#*m#*i#*c#*\\\\s*#*a#*n#*d#*\\\\s*#*S#*o#*c#*i#*a#*l#*\\\\s*#*A#*f#*f#*a#*i#*r#*s#*\\\\,#*\\\\s*#*s#*u#*b#*p#*r#*o#*g#*r#*a#*m#*m#*e#*\\\\s*#*3#*\\\\,#*\\\\s*#*e#*x#*p#*e#*c#*t#*e#*d#*\\\\s*#*a#*c#*c#*o#*m#*p#*l#*i#*s#*h#*m#*e#*n#*t#*\\\\s*#*\\\\(#*c#*\\\\)'

And it works fine in all the other phrases that I want to match, but this one has a problem leading to some catastrophic backtracking, can anyone spot it? 它在我想要匹配的所有其他短语中都能正常工作,但是这个有一个问题导致一些灾难性的回溯,有人能发现吗?

The original text is separated with xml tags, so to make it simpler for the regex, I replaced the tags with these # clusters, here is the original text: 原始文本用xml标签分隔,因此为了使正则表达式更简单,我用这些#clusters替换了标签,这里是原始文本:

</w:rPr><w:t>Relationship to the </w:t></w:r><w:r><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>strategic framework </w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t> for the period 2014-2015</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>: Programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)</w:t>

Since the situation is that complex - don't use regex, just go through your line symbol by symbol: 既然情况复杂的-不使用正则表达式,只是通过你的行逐符号:

etalone = "String to find"
etalone_length = len(etalone)
counter = 0
for symbol in your_line:
    if symbol == etalone[counter]:
        counter += 1
        if counter == etalone_length:
            print("String matches")
            break
    elif symbol != " " and sybmol != "#":
        # Bad char found
        print("Does not match!")
else:  # exited 'for' before full etalone matched
    print("Does not match!")

I just figured out, that above will not, actually, work if the very first symbol we match is not the one we're looking for. 我只是想通了,如果我们匹配的第一个符号不是我们正在寻找的符号,那么上面的实际上不会起作用。 How about this instead: 怎么样呢:

  1. Clone your string 克隆你的字符串
  2. Remove "#" from clone 从克隆中删除“#”
  3. Match against pattern 与模式相匹配
  4. If pattern matches - get the location of matched result 如果模式匹配 - 获取匹配结果的位置
  5. By that location - find which exact occurrence of the first symbol was matched. 按该位置 - 查找第一个符号的确切匹配项。 Like if full line is a#b##ca#d#f and the line we're looking for is adf then we would start matching from the second a symbol. 就像全线是a#b##ca#d#f而我们要寻找的线是adf然后我们将从第二 a符号开始匹配。
  6. Find nth occurrence of symbol a in the original line. 在原始行中查找第n个符号a Set counter = 设置计数器=
  7. Use above algorithm (storing as span start and counter before break as span end) 使用上述算法(作为跨度开始和break前的计数器存储为跨度结束)

If I understand the problem correctly, here's a way to tackle the problem without resorting to pathological regular expressions or character-by-character parsing: 如果我正确理解了这个问题,这里可以解决问题,而不需要使用病态正则表达式或逐个字符解析:

def do_it(search, text_orig, verbose = False):
    # A copy of the text without any "#" markers.
    text_clean = text_orig.replace('#', '')

    # Start position of search text in the cleaned text.
    try:               i = text_clean.index(search)
    except ValueError: return [None, None]

    # Collect the widths of the runs of markers and non-markers.
    rgx    = re.compile(r'#+|[^#]+')
    widths = [len(m.group()) for m in rgx.finditer(text_orig)]

    # From that data, we can compute the span.
    return compute_span(i, len(search), widths, text_orig[0] == '#')

And here's a fairly simple way to compute the spans from the width data. 这是从宽度数据计算跨度的一种相当简单的方法。 My first attempt was incorrect, as noted by eyquem. 正如eyquem所指出的,我的第一次尝试是不正确的。 The second attempt was correct but complex. 第二次尝试是正确但复杂的。 This third approach seems both simple and correct. 第三种方法似乎既简单又正确。

def compute_span(span_start, search_width, widths, is_marker):
    span_end       = span_start + search_width - 1
    to_consume     = span_start + search_width
    start_is_fixed = False

    for w in widths:
        if is_marker:
            # Shift start and end rightward.
            span_start += (0 if start_is_fixed else w)
            span_end   += w
        else:
            # Reduce amount of non-marker text we need to consume.
            # As that amount gets smaller, we'll first fix the
            # location of the span_start, and then stop.
            to_consume -= w
            if to_consume < search_width:
                start_is_fixed = True
                if to_consume <= 0: break
        # Toggle the flag.
        is_marker = not is_marker

    return [span_start, span_end]

And a bunch of tests to keep the critics at bay: 还有一系列测试来阻止批评者:

def main():
    tests = [
        #                0123456789012345678901234567890123456789
        ( [None, None], '' ),
        ( [ 0,  5],     'foobar' ),
        ( [ 0,  5],     'foobar###' ),
        ( [ 3,  8],     '###foobar' ),
        ( [ 2,  7],     '##foobar###' ),
        ( [25, 34],     'BLAH ##BLAH fo####o##ba##foo###b#ar' ),
        ( [12, 26],     'BLAH ##BLAH fo####o##ba###r## BL##AH' ),
        ( [None, None], 'jkh##jh#f' ),
        ( [ 1, 12],     '#f#oo##ba###r##' ),
        ( [ 4, 15],     'a##xf#oo##ba###r##' ),
        ( [ 4, 15],     'ax##f#oo##ba###r##' ),
        ( [ 7, 18],     'ab###xyf#oo##ba###r##' ),
        ( [ 7, 18],     'abx###yf#oo##ba###r##' ),
        ( [ 7, 18],     'abxy###f#oo##ba###r##' ),
        ( [ 8, 19],     'iji#hkh#f#oo##ba###r##' ),
        ( [ 8, 19],     'mn##pps#f#oo##ba###r##' ),
        ( [12, 23],     'mn##pab###xyf#oo##ba###r##' ),
        ( [12, 23],     'lmn#pab###xyf#oo##ba###r##' ),
        ( [ 0, 12],     'fo##o##ba###r## aaaaaBLfoob##arAH' ),
        ( [ 0, 12],     'fo#o##ba####r## aaaaaBLfoob##ar#AH' ),
        ( [ 0, 12],     'f##oo##ba###r## aaaaaBLfoob##ar' ),
        ( [ 0, 12],     'f#oo##ba####r## aaaaBL#foob##arAH' ),
        ( [ 0, 12],     'f#oo##ba####r## aaaaBL#foob##ar#AH' ),
        ( [ 0, 12],     'foo##ba#####r## aaaaBL#foob##ar' ),
        ( [ 1, 12],     '#f#oo##ba###r## aaaBL##foob##arAH' ),
        ( [ 1, 12],     '#foo##ba####r## aaaBL##foob##ar#AH' ),
        ( [ 2, 12],     '#af#oo##ba##r## aaaBL##foob##ar' ),
        ( [ 3, 13],     '##afoo##ba###r## aaaaaBLfoob##arAH' ),
        ( [ 5, 17],     'BLAHHfo##o##ba###r aaBLfoob##ar#AH' ),
        ( [ 5, 17],     'BLAH#fo##o##ba###r aaBLfoob##ar' ),
        ( [ 5, 17],     'BLA#Hfo##o##ba###r###BLfoob##ar' ),
        ( [ 5, 17],     'BLA#Hfo##o##ba###r#BL##foob##ar' ),
    ]
    for exp, t in tests:
        span = do_it('foobar', t, verbose = True)
        if exp != span:
            print '\n0123456789012345678901234567890123456789'
            print t
            print n
            print dict(got = span, exp = exp)

main()

Another simpler solution is to remove the pound keys with 另一个更简单的解决方案是删除井号键

your_string.replace('#', '')

And test your regex (without all the #*) against the string returned by replace. 并且针对replace返回的字符串测试你的正则表达式(没有所有的#*)。

The backtracking catastrophe may be caused because your regex contains multiple instances of the pattern #*\\\\s*#* : Each of these will match any block of repeated # , but it can match the same text in multiple ways. 可能会导致回溯灾难,因为正则表达式包含模式的多个实例#*\\\\s*#* :其中每个都将匹配重复#任何块,但它可以通过多种方式匹配相同的文本。 When you have several of these patterns in your regex, the numbers of possibilities multiply. 当你的正则表达式中有几个这样的模式时,可能性的数量会增加。

Are your searching in a larger body of text? 你在更大的文本中搜索? If so, does the text contain phrases which coincide with the beginning of your search text? 如果是这样,文本是否包含与搜索文本开头重合的短语? If so, the regex engine matches the beginning of the pattern, and on finding a mismatch, starts backtracking. 如果是这样,正则表达式引擎匹配模式的开头,并且在找到不匹配时,开始回溯。

Note that the text framework ################## for is not matched by the regex f#*r#*a#*m#*e#*w#*o#*r#*k#*\\\\s*#*f#*o#*r because of an unmatched space characters. 请注意,文本framework ################## for与正则表达式匹配f#*r#*a#*m#*e#*w#*o#*r#*k#*\\\\s*#*f#*o#*r因为空格字符不匹配。

Possible solutions with regexes: 正则表达式的可能解决方案:

1 Use possessive quantifiers instead of the standard greedy quantifiers. 1使用所有格量词而不是标准的贪心量词。 Unfortunately, according to this page , Python does not support possessive quantifiers. 不幸的是,根据这个页面 ,Python不支持占有量词。

2 Replace the pattern #*\\\\s*#* with (#|\\\\s)* , which will reduce the number of ways your regex can match a text. 2将模式#*\\\\s*#*替换为(#|\\\\s)* ,这将减少正则表达式与文本匹配的方式。 Note that this changed regex can match more than your original text (specifically, the suggested pattern will match the text ## ## ## which the original pattern does not match). 请注意,此更改的正则表达式可以匹配比原始文本更多的内容(具体而言,建议的模式将匹配原始模式不匹配的文本## ## ## )。

In a previous answer, I used re and difflib modules, and your principle of replacing every tag with a character. 在之前的回答中,我使用了redifflib模块,以及用字符替换每个标记的原则。
But I realized that your problem can be solved using only re and without having to do substitution with an arbitrary character. 但我意识到你的问题只能用re来解决,而不必用任意字符替换。

Import 进口

import re

Data 数据

I used tuples to be able to display data in a more readable form during execution 我使用元组能够在执行期间以更易读的形式显示数据

Note that I slightly modified the data to avoid some problems: 请注意,我稍微修改了数据以避免一些问题:
put only one blank between framework and for the period , 框架期间只留下一个空白,
major P at Programm 7 in the two strings, etc 程序7中的主要P在两个字符串中等

Norte also that I added a sequence of characters ### in phrase and xmltext (in front of the date 2014-2015) to show that my code still works in this case. Norte我还在phrasexmltext添加了一系列字符### (在2014-2015之前),以显示我的代码在这种情况下仍然有效。 Other answers don't manage this eventuality. 其他答案不管理这种可能性。

Phrase 短语

tu_phrase = ('Relationship to the ',
             'strategic framework ',
             'for the period ###2014-2015',
             ': Programme 7, Economic and Social Affairs, ',
             'subprogramme 3, expected accomplishment (c)')
phrase = ''.join(tu_phrase)

XML text XML文本

tu_xmltext = ('EEEEEEE',
              '<w:rPr>',
              'AAAAAAA',
              '</w:rPr><w:t>',
              'Relationship to the ',
              '</w:t></w:r><w:r>',
              '<w:rPr><w:i/>',
              '<w:sz w:val="17"/><w:sz-cs w:val="17"/>'
              'strategic framework ',
              '</w:t></w:r><w:r wsp:rsidRPr="00EC3076">',
              '<w:sz w:val="17"/><w:sz-cs w:val="17"/>',
              '</w:rPr><w:t>',
              'for the period ###2014-2015',
              '</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr>',
              '<w:sz w:val="17"/><w:sz-cs w:val="17"/>',
              '</w:rPr><w:t>',
              ': Programme 7, Economic and Social Affairs, ',
              'subprogramme 3, expected accomplishment (c)',
              '</w:t>',
              '321354641331')
xmltext = ''.join(tu_xmltext)

The working functions 工作职能

The function olding_the_new(stuvw , pat_for_sub) returns a list of triples (pmod,w,pori) 函数olding_the_new(stuvw,pat_for_sub)返回三元组列表(pmod,w,pori)
expressing correspondance of positions of common sequences 表达共同序列的位置的对应关系
in stuvw and re.sub(pat_for_sub, stuvw) . stuvwre.sub(pat_for_sub, stuvw)
These sequences are the ones in stuvw that aren't catched by group(1) of pat_for_sub : 这些序列是stuvw中没有被stuvwgroup(1) pat_for_sub
- (pmod,w) describes a sequence in re.sub(pat_for_sub, stuvw) - (pmod,w)描述了re.sub(pat_for_sub, stuvw)中的序列re.sub(pat_for_sub, stuvw)
- pmod is its position in re.sub(pat_for_sub, stuvw) - pmod是它在re.sub(pat_for_sub, stuvw)位置re.sub(pat_for_sub, stuvw)
- w is its width [it's the same in re.sub(pat_for_sub, stuvw) and stuvw] - w是它的宽度[在re.sub(pat_for_sub,stuvw)和stuvw中是相同的]
- pori is the position of this sequence in original stuvw - pori是这个序列在原始stuvw的位置

def olding_the_new(stuvw,pat_for_sub):
    triples = []
    pmod = 0 # pmod = position in modified stuvw,
             # that is to say in re.sub(pat_for_sub,'',stuvw)
    for mat in re.finditer('{0}|([\s\S]+?)(?={0}|\Z)'.format(pat_for_sub),
                           stuvw):
        if mat.group(1):
            triples.append((pmod,mat.end()-mat.start(),mat.start()))
            pmod += mat.end()-mat.start()
    return triples


def finding(LITTLE,BIG,pat_for_sub,
            olding_the_new=olding_the_new):
    triples = olding_the_new(BIG,'(?:%s)+' % pat_for_sub)
    modBIG = re.sub(pat_for_sub,'',BIG)
    modLITTLE = re.escape(LITTLE)
    for mat in re.finditer(modLITTLE,modBIG):
        st,nd = mat.span() # in modBIG
        sori = -1 # start original, id est in BIG
        for tr in triples:
            if st < tr[0]+tr[1] and sori<0:
                sori = tr[2] + st - tr[0] 
            if nd<=tr[0]+tr[1]:
                yield(sori, tr[2] + nd - tr[0])
                break

Execution 执行

if __name__ == '__main__':
    print ('---------- phrase ----------\n%s\n'
           '\n------- phrase written in a readable form --------\n'
           '%s\n\n\n'
           '---------- xmltext ----------\n%s\n'
           '\n------- xmltext written in a readable form --------\n'
           '%s\n\n\n'
           %
           (phrase  , '\n'.join(tu_phrase),
            xmltext , '\n'.join(tu_xmltext))    )

    print ('*********************************************************\n'
           '********** Searching for phrase in xmltext **************\n'
           '*********************************************************')

    spans = finding(phrase,xmltext,'</?w:[^>]*>')
    if spans:
        for s,e in spans:
            print ("\nspan in string 'xmltext' :  (%d , %d)\n\n"
                   'xmltext[%d:%d] :\n%s'
                   % (s,e,s,e,xmltext[s:e]))
    else:
        print ("-::: The first string isn't in second string :::-")

RESULT 结果

*********************************************************
********** Searching for phrase in xmltext **************
*********************************************************

span in string 'xmltext' :  (34 , 448)

xmltext[34:448] :
Relationship to the </w:t></w:r><w:r><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/>strategic framework </w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>for the period ###2014-2015</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>: Programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)

Nota Bene Nota Bene

My code can't detect a phrase in an XML document when the sequences of whitespaces between two words are not exactly the same in phrase and in the XML text. 当两个单词之间的空格序列在短语和XML文本中不完全相同时,我的代码无法检测XML文档中的短语。
I tried to obtain this possibility, but it's too much complicated. 我试图获得这种可能性,但它太复杂了。
In your example, in the XML sequence you show, there's a blank between strategic framework and the following tags, and another blank between these tags and the following for the period . 在您的示例中,在您显示的XML序列中, 战略框架和以下标记之间存在空白,并且这些标记与此期间的以下内容之间存在另一个空白。 In this condition, my code couldn't work (I doubt that other answers can do better on this point), then I used an xmltext without a blank in front of for the period . 在这种情况下,我的代码不能工作(我怀疑,其他的答案可以做到这一点更好),然后我用了一个xmltext没有在期间前是一片空白。

On the other side, my code doesn't use a replacement character, then any character may be in the XML document and the phrase, without having any problem with a character that would be in them while used as the replacement character. 另一方面,我的代码不使用替换字符,那么任何字符都可能在XML文档和短语中,而在用作替换字符时,其中的字符没有任何问题。

My code directly gives span in the original XML document, not in an intermediary text modified with a replacement character. 我的代码直接在原始XML文档中提供span,而不是在使用替换字符修改的中间文本中。

It gives all the occurences of phrase in the XML document, not only the first. 它给出了XML文档中phrase所有出现,而不仅仅是第一个。

................................ ................................

With following data: 有以下数据:

print ('\n*********************************************************\n'
       "********* Searching for 'foobar' in samples *************\n"
       '*********************************************************')

for xample in ('fo##o##ba###r## aaaaaBLfoob##arAH',
               '#fo##o##ba###r## aaaaaBLfoob##arAH',
               'BLAHHfo##o##ba###r   BLfoob##arAH',
               'BLAH#fo##o##ba###rBLUHYfoob##arAH',
               'BLA# fo##o##ba###rBLyyyfoob##ar',
               'BLA# fo##o##ba###rBLy##foob##ar',
               'kjhfqshqsk'):
    spans = list(finding('foobar',xample,'#'))
    if spans:
        print ('\n%s\n%s'
               %
               (xample,
                '\n'.join('%s  %s'
                          % (sp,xample[sp[0]:sp[1]])
                          for sp in spans))
               )
    else:
        print ("\n%s\n-::: Not found :::-" % xample)

the results are: 结果是:

*********************************************************
********* Searching for 'foobar' in samples *************
*********************************************************

fo##o##ba###r## aaaaaBLfoob##arAH
(0, 13)  fo##o##ba###r
(23, 31)  foob##ar

#fo##o##ba###r## aaaaaBLfoob##arAH
(1, 14)  fo##o##ba###r
(24, 32)  foob##ar

BLAHHfo##o##ba###r   BLfoob##arAH
(5, 18)  fo##o##ba###r
(23, 31)  foob##ar

BLAH#fo##o##ba###rBLUHYfoob##arAH
(5, 18)  fo##o##ba###r
(23, 31)  foob##ar

BLA# fo##o##ba###rBLyyyfoob##ar
(5, 18)  fo##o##ba###r
(23, 31)  foob##ar

BLA# fo##o##ba###rBLy##foob##ar
(5, 18)  fo##o##ba###r
(23, 31)  foob##ar

kjhfqshqsk
-::: Not found :::-

.......................................... ..........................................

With the following code, I examined your question: 使用以下代码,我检查了您的问题:

import urllib

sock = urllib.urlopen('http://stackoverflow.com/'
                      'questions/17381982/'
                      'python-regex-catastrophic-backtracking-where')
r =sock.read()
sock.close()

i = r.find('unpredictable, such as the following')
j = r.find('in order to match the following phrase')
k = r.find('I came up with this regex ')

print 'i == %d   j== %d' % (i,j)
print repr(r[i:j])


print
print 'j == %d   k== %d' % (j,k)
print repr(r[j:k])

The result is: 结果是:

i == 10408   j== 10714
'unpredictable, such as the following:</p>\n\n<blockquote>\n  Relationship to the #################strategic framework ################## for the period 2014-2015####################: Programme 7, Economic and Social Affairs, subprogramme 3, expected\n  \n  <p>accomplishment (c)#######</p>\n</blockquote>\n\n<p>so '

j == 10714   k== 10955
'in order to match the following phrase:</p>\n\n<blockquote>\n  <p>Relationship to the strategic framework for the period 2014-2015:\n  programme 7, Economic and Social Affairs, subprogramme 3, expected\n  accomplishment (c)</p>\n</blockquote>\n\n<p>'

Note the additional \\n in front of programme 7 , additional \\n <p> in front of accomplishment , the difference Programme 7 and programme 7 , and the presence of two blanks between framework and for the period in the string framework ################## for the period 注意附加\\n方案7的前方,另外\\n <p>完成前,差别方案7方案7框架之间,并且字符串框架 期间两个坯件的存在#### ############## 该期间
This may explain your difficulties with your example. 这可以解释你的例子的困难。

The following code shows that FMc 's code doesn't work. 以下代码显示FMc的代码不起作用。

The line 这条线
from name_of_file import olding_the_new,finding refers to my code in my personal answer in this thread to this question. from name_of_file import olding_the_new,finding在我的个人答案中引用我的代码来解决这个问题。
* Give the name name_of_file to the file containing the script of my code (lying in my other answer in this thread) and it will run. *将名称name_of_file给包含我的代码脚本的文件(在此线程的其他答案中),它将运行。
* Or if you are annoyed to copy-paste my code, just comment this line of import, and the following code will run because I put a try-except instruction that will react correctly to the absence of olding_the_new and finding *或者如果你对我的代码进行复制粘贴感到恼火,只需注释这行导入,下面的代码就会运行,因为我放了一个try-except指令,它会对缺少olding_the_new做出正确反应并finding

I used two ways to verify the results of FMc 's code: 我用两种方法来验证FMc代码的结果:
-1/ comparing the span returned by his code with index of 'f' and index of 'r' , as we search for phrase 'foobar' and I managed that there are no f and r other than the ones in foobar -1 /比较,因为我们搜索短语“foobar的”,我管理,有没有f和比foob​​ar的的那些R以外被他用“F”“R”指数,指数代码返回的跨度
-2/ comparing with the first span returned by my code, hence the need of the above import from name_of_file -2 /与我的代码返回的第一个范围进行比较,因此需要从name_of_file导入上述内容

Nota Bene Nota Bene

If disp = None is changed to disp == True , the execution displays intermediary results that help to understand the algorithm. 如果disp = None更改为disp == True ,则执行显示有助于理解算法的中间结果。

.

import re
from name_of_file import olding_the_new,finding

def main():
    # Two versions of the text: the original,
    # and one without any of the "#" markers.
    for text_orig  in ('BLAH ##BLAH fo####o##ba###r## BL##AH',
                       'jkh##jh#f',
                       '#f#oo##ba###r##',
                       'a##xf#oo##ba###r##',
                       'ax##f#oo##ba###r##',
                       'ab###xyf#oo##ba###r##',
                       'abx###yf#oo##ba###r##',
                       'abxy###f#oo##ba###r##',
                       'iji#hkh#f#oo##ba###r##',
                       'mn##pps#f#oo##ba###r##',
                       'mn##pab###xyf#oo##ba###r##',
                       'lmn#pab###xyf#oo##ba###r##',
                       'fo##o##ba###r## aaaaaBLfoob##arAH',
                       'fo#o##ba####r## aaaaaBLfoob##ar#AH',
                       'f##oo##ba###r## aaaaaBLfoob##ar',
                       'f#oo##ba####r## aaaaBL#foob##arAH',
                       'f#oo##ba####r## aaaaBL#foob##ar#AH',
                       'foo##ba#####r## aaaaBL#foob##ar',
                       '#f#oo##ba###r## aaaBL##foob##arAH',
                       '#foo##ba####r## aaaBL##foob##ar#AH',
                       '#af#oo##ba##r## aaaBL##foob##ar',
                       '##afoo##ba###r## aaaaaBLfoob##arAH',
                       'BLAHHfo##o##ba###r aaBLfoob##ar#AH',
                       'BLAH#fo##o##ba###r aaBLfoob##ar',
                       'BLA#Hfo##o##ba###r###BLfoob##ar',
                       'BLA#Hfo##o##ba###r#BL##foob##ar',
                       ):

        text_clean = text_orig.replace('#', '')
        # Collect data on the positions and widths
        # of the markers in the original text.
        rgx     = re.compile(r'#+')
        markers = [(m.start(), len(m.group()))
                   for m in rgx.finditer(text_orig)]

        # Find the location of the search phrase in the cleaned text.
        # At that point you'll have all the data you need to compute
        # the span of the phrase in the original text.
        search = 'foobar'
        try:
            i = text_clean.index(search)
            print ('text_clean == %s\n'
                   "text_clean.index('%s')==%d   len('%s') == %d\n"
                   'text_orig  == %s\n'
                   'markers  == %s'
                   % (text_clean,
                      search,i,search,len(search),
                      text_orig,
                      markers))
            S,E = compute_span(i, len(search), markers)
            print "span = (%d,%d)  %s %s     %s"\
                  % (S,E,
                     text_orig.index('f')==S,
                     text_orig.index('r')+1==E,
                     list(finding(search,text_orig,'#+')))
        except ValueError:
            print ('text_clean == %s\n'
                   "text_clean.index('%s')   ***Not found***\n"
                   'text_orig  == %s\n'
                   'markers  == %s'
                   % (text_clean,
                      search,
                      text_orig,
                      markers))
        print '--------------------------------'

.

def compute_span(start, width, markers):
    # start and width are in expurgated text
    # markers are in original text
    disp = None # if disp==True => displaying of intermediary results
    span_start = start
    if disp:
        print ('\nAt beginning in compute_span():\n'
               '  span_start==start==%d   width==%d'
               % (start,width))
    for s, w in markers: # s and w are in original text
        if disp:
            print ('\ns,w==%d,%d'
                   '   s+w-1(%d)<start(%d) %s'
                   '   s(%d)==start(%d) %s'
                   % (s,w,s+w-1,start,s+w-1<start,s,start,s==start))
        if s + w - 1 < start:
            #mwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmmwmwmwmwmwm
            # the following if-else section is justified to be used
            # only after correction of the above line to this one:
            # if s+w-1 <= start or s==start:
            #mwmwmwmwmwmwmwmwmwmwmwmwmwmwmwm
            if s + w - 1 <= start and disp:
                print '  1a) s + w - 1 (%d) <= start (%d)   marker at left'\
                      % (s+w-1, start)
            elif disp:
                print '  1b) s(%d) == start(%d)' % (s,start)
            #mwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmmwmwmwmwmwm
            # Situation: marker fully to left of our text.
            # Adjust our start points rightward.
            start      += w
            span_start += w
            if disp:
                print '  span_start == %d   start, width == %d, %d' % (span_start, start, width)
        elif start + width - 1 < s:
            if disp:
                print ('  2) start + width - 1 (%d) < s (%d)   marker at right\n'
                       '  break' % (start+width-1, s))
            # Situation: marker fully to the right of our text.
            break
        else:
            # Situation: marker interrupts our text.
            # Advance the start point for the remaining text
            # rightward, and reduce the remaining width.
            if disp:
                print "  3) In 'else': s - start == %d   marker interrupts" % (s - start)
            start += w
            width = width - (s - start)
            if disp:
                print '  span_start == %d   start, width == %d, %d' % (span_start, start, width)
    return (span_start, start + width)

.

main()

Result 结果

>>> 
text_clean == BLAH BLAH foobar BLAH
text_clean.index('foobar')==10   len('foobar') == 6
text_orig  == BLAH ##BLAH fo####o##ba###r## BL##AH
markers  == [(5, 2), (14, 4), (19, 2), (23, 3), (27, 2), (32, 2)]
span = (12,26)  True False     [(12, 27)]
--------------------------------
text_clean == jkhjhf
text_clean.index('foobar')   ***Not found***
text_orig  == jkh##jh#f
markers  == [(3, 2), (7, 1)]
--------------------------------
text_clean == foobar
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == #f#oo##ba###r##
markers  == [(0, 1), (2, 1), (5, 2), (9, 3), (13, 2)]
span = (0,11)  False False     [(1, 13)]
--------------------------------
text_clean == axfoobar
text_clean.index('foobar')==2   len('foobar') == 6
text_orig  == a##xf#oo##ba###r##
markers  == [(1, 2), (5, 1), (8, 2), (12, 3), (16, 2)]
span = (2,16)  False True     [(4, 16)]
--------------------------------
text_clean == axfoobar
text_clean.index('foobar')==2   len('foobar') == 6
text_orig  == ax##f#oo##ba###r##
markers  == [(2, 2), (5, 1), (8, 2), (12, 3), (16, 2)]
span = (2,15)  False False     [(4, 16)]
--------------------------------
text_clean == abxyfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == ab###xyf#oo##ba###r##
markers  == [(2, 3), (8, 1), (11, 2), (15, 3), (19, 2)]
span = (4,19)  False True     [(7, 19)]
--------------------------------
text_clean == abxyfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == abx###yf#oo##ba###r##
markers  == [(3, 3), (8, 1), (11, 2), (15, 3), (19, 2)]
span = (4,18)  False False     [(7, 19)]
--------------------------------
text_clean == abxyfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == abxy###f#oo##ba###r##
markers  == [(4, 3), (8, 1), (11, 2), (15, 3), (19, 2)]
span = (4,19)  False True     [(7, 19)]
--------------------------------
text_clean == ijihkhfoobar
text_clean.index('foobar')==6   len('foobar') == 6
text_orig  == iji#hkh#f#oo##ba###r##
markers  == [(3, 1), (7, 1), (9, 1), (12, 2), (16, 3), (20, 2)]
span = (7,18)  False False     [(8, 20)]
--------------------------------
text_clean == mnppsfoobar
text_clean.index('foobar')==5   len('foobar') == 6
text_orig  == mn##pps#f#oo##ba###r##
markers  == [(2, 2), (7, 1), (9, 1), (12, 2), (16, 3), (20, 2)]
span = (7,18)  False False     [(8, 20)]
--------------------------------
text_clean == mnpabxyfoobar
text_clean.index('foobar')==7   len('foobar') == 6
text_orig  == mn##pab###xyf#oo##ba###r##
markers  == [(2, 2), (7, 3), (13, 1), (16, 2), (20, 3), (24, 2)]
span = (9,24)  False True     [(12, 24)]
--------------------------------
text_clean == lmnpabxyfoobar
text_clean.index('foobar')==8   len('foobar') == 6
text_orig  == lmn#pab###xyf#oo##ba###r##
markers  == [(3, 1), (7, 3), (13, 1), (16, 2), (20, 3), (24, 2)]
span = (9,24)  False True     [(12, 24)]
--------------------------------
text_clean == foobar aaaaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == fo##o##ba###r## aaaaaBLfoob##arAH
markers  == [(2, 2), (5, 2), (9, 3), (13, 2), (27, 2)]
span = (0,9)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == fo#o##ba####r## aaaaaBLfoob##ar#AH
markers  == [(2, 1), (4, 2), (8, 4), (13, 2), (27, 2), (31, 1)]
span = (0,7)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaaBLfoobar
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == f##oo##ba###r## aaaaaBLfoob##ar
markers  == [(1, 2), (5, 2), (9, 3), (13, 2), (27, 2)]
span = (0,11)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == f#oo##ba####r## aaaaBL#foob##arAH
markers  == [(1, 1), (4, 2), (8, 4), (13, 2), (22, 1), (27, 2)]
span = (0,8)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == f#oo##ba####r## aaaaBL#foob##ar#AH
markers  == [(1, 1), (4, 2), (8, 4), (13, 2), (22, 1), (27, 2), (31, 1)]
span = (0,8)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaBLfoobar
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == foo##ba#####r## aaaaBL#foob##ar
markers  == [(3, 2), (7, 5), (13, 2), (22, 1), (27, 2)]
span = (0,7)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == #f#oo##ba###r## aaaBL##foob##arAH
markers  == [(0, 1), (2, 1), (5, 2), (9, 3), (13, 2), (21, 2), (27, 2)]
span = (0,11)  False False     [(1, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == #foo##ba####r## aaaBL##foob##ar#AH
markers  == [(0, 1), (4, 2), (8, 4), (13, 2), (21, 2), (27, 2), (31, 1)]
span = (0,12)  False False     [(1, 13), (23, 31)]
--------------------------------
text_clean == afoobar aaaBLfoobar
text_clean.index('foobar')==1   len('foobar') == 6
text_orig  == #af#oo##ba##r## aaaBL##foob##ar
markers  == [(0, 1), (3, 1), (6, 2), (10, 2), (13, 2), (21, 2), (27, 2)]
span = (2,10)  True False     [(2, 13), (23, 31)]
--------------------------------
text_clean == afoobar aaaaaBLfoobarAH
text_clean.index('foobar')==1   len('foobar') == 6
text_orig  == ##afoo##ba###r## aaaaaBLfoob##arAH
markers  == [(0, 2), (6, 2), (10, 3), (14, 2), (28, 2)]
span = (1,14)  False True     [(3, 14), (24, 32)]
--------------------------------
text_clean == BLAHHfoobar aaBLfoobarAH
text_clean.index('foobar')==5   len('foobar') == 6
text_orig  == BLAHHfo##o##ba###r aaBLfoob##ar#AH
markers  == [(7, 2), (10, 2), (14, 3), (27, 2), (31, 1)]
span = (5,14)  True False     [(5, 18), (23, 31)]
--------------------------------
text_clean == BLAHfoobar aaBLfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == BLAH#fo##o##ba###r aaBLfoob##ar
markers  == [(4, 1), (7, 2), (10, 2), (14, 3), (27, 2)]
span = (4,16)  False False     [(5, 18), (23, 31)]
--------------------------------
text_clean == BLAHfoobarBLfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == BLA#Hfo##o##ba###r###BLfoob##ar
markers  == [(3, 1), (7, 2), (10, 2), (14, 3), (18, 3), (27, 2)]
span = (5,14)  True False     [(5, 18), (23, 31)]
--------------------------------
text_clean == BLAHfoobarBLfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == BLA#Hfo##o##ba###r#BL##foob##ar
markers  == [(3, 1), (7, 2), (10, 2), (14, 3), (18, 1), (21, 2), (27, 2)]
span = (5,14)  True False     [(5, 18), (23, 31)]
--------------------------------
>>> 

.

--------------------------------------------- ---------------------------------------------

The code of FMc is very subtle, I had a long hard time to understand its principle and then to be able to correct it. FMc的代码非常微妙,我很难理解它的原理然后能够纠正它。
I will let anybody the task to understand the algorithm. 我会让任何人了解算法的任务。 I only say the corrections required to make the code of FMc to work correctly: 我只说要使FMc的代码正常工作所需的更正:

.

First correction: 第一次更正:

if s + w - 1 < start:
# must be changed to  
if s + w - 1 <= start or (s==start):

EDIT 编辑

In my initial present answer, 在我最初的答案中,
I had written ... or (s<=start) . 我写过... or (s<=start)
That was an error of me, in fact I had the intention to write 这是我的错误,事实上我有意写
.. or (s==start)

NOTA BENE about this EDIT: 关于此编辑的NOTA BENE:

This error had no consequence in the code corrected with the two corrections I describe here to correct the initial code of FMc (the very first one, because at present he has changed it two times). 这个错误在我用这里描述的两个校正更正的代码中没有任何影响,以纠正FMc的初始代码(第一个,因为目前他已经改变了两次)。
Indeed, if you correct the code with the two corrections, you'll obtain correct results with all the 25 examples taken for text_orig , as well with ... or (s <= start) as with ... or (s==start) . 实际上,如果您使用两个更正来更正代码,那么您将获得正确的结果,其中包含text_orig所有25个示例,以及... or (s <= start)... or (s==start)
So I thought that the case in which s < start is True could never happen when the first condition s+w-1 <= start is False, presumely based on the fact that w is always greater than 0 and some other reason due to the configuration of the markers and non-marker sequences..... 所以我认为当第一个条件s+w-1 <= start为False时, s < start为True的情况永远不会发生,假设基于w总是大于0的事实以及由于标记和非标记序列的配置.....
So I tried to find the demonstration of this impression... and I failed. 所以我试图找到这种印象的证明......我失败了。
Moreover, I reached a state of mind in which I even no more understand the algorithm of FMc (the very first one before any edit he did) !! 此外,我达到了一种心态,我甚至不再理解FMc的算法(在他做任何编辑之前的第一个)!
Despite this, I let this answer as it is, and I post, at the end of this answer, the explanations trying to explain why these corrections are needed. 尽管如此,我还是按照这个答案,在这个答案的最后,我发布了解释为什么需要这些修正的解释。
But I warn: the very first algorithm of FMc is very outlandish and difficult to comprehend because it does comparison of indices belonging to two different strings, one which is text_orig with markers #### and another one cleaned of all these markers.... and now I am no more convinced that may have a sense.... 但我警告说: FMc的第一个算法非常古怪且难以理解,因为它比较属于两个不同字符串的索引,一个是带有标记####的text_orig,另一个是清除所有这些标记的...现在我不再相信可能有意义......

.

Second correction:: 二次修正::

start += w
width = width - (s - start)
# must be changed to   
width -= (s-start) # this line MUST BE before the following one
start = s + w # because start += (s-start) + w

------------------- -------------------

I am stunned that 2 people upvoted the answer of FMc though it gives a wrong code. 我目瞪口呆的是,2人upvoted FMC的答案,虽然它给出了一个错误的代码。 It means that they upvoted an answer without having tested the given code. 这意味着他们在没有测试给定代码的情况下提出了答案。

---------------------------------------- ----------------------------------------

.

EDIT 编辑

Why must the condition if s + w - 1 < start: must be changed to this one: 为什么必须条件if s + w - 1 < start:必须改为这个:
if s + w - 1 <= start or (s==start): ? if s + w - 1 <= start or (s==start):

Because it may happen that s + w - 1 < start should be False and s equals start together. 因为可能发生s + w - 1 < start应为False且s等于一起start
In this case, the execution goes to the else section and executes (in corrected code): 在这种情况下,执行转到else部分并执行(在更正的代码中):

width -= (s - start)
start = s + w

Consequently, width doesn't change while it evidently should change when we see the sequence concerned. 因此,当我们看到相关序列时, width不会改变,而显然应该改变。

This case may occur at the moment the first marker is examined, as with the following sequences: 这种情况可能在检查第一个标记时发生,如下列顺序:

'#f#oo##ba###r##' : s,w==0,1 , 0==s==start==0  
'ax##f#oo##ba###r##' : s,w==2,2 , 2==s==start==2    
'abxy###f#oo##ba###r##' : s,w==4,3 , 4==s==start==4  
'#f#oo##ba###r## aaaBL##foob##arAH' : s,w==0,1 , 0==s==start==0  
'BLAH#fo##o##ba###r aaBLfoob##ar' : s,w==4,1 4==s==start==4

With the following ones, it occurs for the examination of the second marker: 通过以下方法,可以检查第二个标记:

'iji#hkh#f#oo##ba###r##' : s,w==7,1 , 7==s==start==7  
'mn##pps#f#oo##ba###r##' : s,w==7,1 , 7==s==start==7  

It can be understood better by executing my code with disp = True setted. 通过使用disp = True setted执行我的代码可以更好地理解它。

When s + w - 1 <= start is verified, the fact that s may equal start isn't troublesome because the execution doesn't go to the else section, it goes to the first section in which there's only the addition of w to s and to start . s + w - 1 <= start被验证时, s可能等于start的事实并不麻烦,因为执行没有进入else部分,它进入第一部分,其中只有w的加法sstart
But when s + w - 1 <= start is False while s equals start , the execution goes to the else section where the execution of instruction width -= (s-start) doesn't change anything to the value of width and that's troublesome. 但是当s + w - 1 <= start为False而s等于start ,执行进入else部分,其中执行指令width -= (s-start)不会改变任何宽度值,这很麻烦。
So the condition or (s==start) must be added to prevent this else destination, and it is needed to put it after an or to prevent this destination even when s+w-1 <= start is False, which can happen as some examples show it. 所以必须添加条件or (s==start)以防止这个else目的地,并且需要将它放在一个or之后,即使当s+w-1 <= start为False时也会阻止此目的地,这可能发生在一些例子表明了它。

.

Concerning the fact that the instruction s+w-1 < start must be changed to s+w-1 <= start (with =), 关于指令s+w-1 < start必须改为s+w-1 <= start (with =)的事实,
it's because of the case w==1 corresponding to 1 character # only , 这是因为w==1的情况只对应1个字符
as for the cases 至于案件
mn##pps#f#oo##ba###r## (second marker) mn##pps#f#oo##ba###r## (第二个标记)
and BLAH#fo##o##ba###r (first marker). BLAH#fo##o##ba###r (第一个标记)。

不使用regex您可以获得您想要做的事情:

text.replace('#','').replace('  ',' ')

Depth first search with XML Parser ? 深度优先使用XML Parser进行搜索?

Maybe remember the position in the xml document where the text node was found, for later reverse lookup. 也许记住xml文档中找到文本节点的位置,以便以后进行反向查找。 You actual goal is still unclear. 你的实际目标仍不清楚。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM