简体   繁体   English

使用startswith和index从结构化字符串中查找substring

[英]Find a substring from a structured string using startswith and index

I am trying to create a code that finds a substring (can be a number or anything else) from a structured string.我正在尝试创建一个从结构化字符串中找到 substring (可以是数字或其他任何东西)的代码。 The string is structured (2 possibilites) like:字符串的结构(2 种可能性)如下:

  1. string = '1x substring 3x 4x'字符串 = '1x substring 3x 4x'

  2. string = '4x 3x substring 1x'字符串 = '4x 3x substring 1x'

    • x can be any character x可以是任何字符
    • substring formatted like 'pos. 2' substring格式类似于'pos. 2' 'pos. 2'

The normal case works with the code below, but now I would also like to consider the special cases.I have tried: i.startswith(('3','4')) , but that didn't work.正常情况适用于下面的代码,但现在我也想考虑特殊情况。我尝试过: i.startswith(('3','4')) ,但这不起作用。

string 1-8 should explain the logic using a simple example.字符串 1-8 应该使用一个简单的例子来解释逻辑。

string 9-10 shows a complex example.字符串 9-10 显示了一个复杂的示例。 The code should extract the substrings at Pos 2, 5 and Pos 7.代码应在 Pos 2、5 和 Pos 7 处提取子字符串。

I hope you can help to find a solution for all strings / special cases to get clean: 80 for all cases.我希望您能帮助找到所有字符串/特殊情况的解决方案以clean: 80 :-)

string 9:字符串 9:

clean: pos2 ='$80' pos5 = '75.000 kg' pos7 = '22 sec'

string 10:字符串 10:

clean: pos2 = '$67' pos5 = '69.000kg' pos7 = '12sec'


#str1-8 easy example strings

#normal string
str1 = '1x 80 3x 4x'
str2 = '4x 3x 80 1x'

# missing number/pos. 3
str3 = '1x 80 4x '
str4 = '4x 80 1x'
str3a = '1. A  $67  4. A  69.000kg  6. A  12sec  8. B  9. B'
result: `clean: pos2 = '$67'   pos5 = '69.000kg'    pos7 = '12sec'` 
str4a = '9B 8B 22 sec 6A 75.000kg 4b  $80 1b'
result: `clean:  pos2 ='$80'    pos5 = '75.000 kg'   pos7 = '22 sec'` 

# missing number/pos. 1, => number is at the start or end of the string
str5 = '80 3x 4x'
str6 = '4x 3x 80'
str5a =  '10 Mrd 3: A 4: A  50 .379 6: A   7:19   8: B 9: D ' 
result: clean: pos2= 10 Mrd, pos5= 50,379 pos7=7:19 (or just 19 in raw string without 7: if its easier)
str6a = '  9a 8b 10 6b 60000 4a 3 b 50 '
result: clean: pos2= 50, pos5= 60000 pos7=10 

# Optional (rare case)
# missing number/pos. 1 and 3
str7 = '80 4x'
str8 = '4x 80'
str7a = '10 Mrd 4: A  50 .379 6: A   7:19   8: B 9: D ' 
result: clean: pos2= 10 Mrd, pos5= 50,379 pos7=7:19 (or just 19 in raw string without 7: if its easier)
str8a = ' 9a 8b 10 6b 60000 4a  50 '
result: clean: pos2= 50, pos5= 60000 pos7=10 
# complex realistic strings
str9 = '9B 8B 22 sec 6A 75.000kg 4b 3b  $80 1b'
str10 = '1. A  $67  3. A  4. A  69.000kg  6. A  12sec  8. B  9. B'

# missing number/pos. 4 or 6 (Pos6 Optional, cause thats difficult i guess)
str11 = '1. A  $67  3. A    69.000kg  6. A  12sec  8. B  9. B'
result: `clean: pos2 = '$67'   pos5 = '69.000kg'    pos7 = '12sec'
str12 = '1. A  $67  3. A  4a  69.000kg   12sec  8. B  9. B'
result: `clean: pos2 = '$67'   pos5 = '69.000kg'    pos7 = '12sec'

x_list = [str1,str2,str3,str4,str5,str6,str7,str8, str9, str10, str11,str12]

for x in x_list:
    print ("raw             "+x)
    
    values = ['1x', '3x', '4x']
    try:
        for i in values:
            if i.startswith('3') :
                foo=i

            if  i.startswith("1") :
                baa=i 

            start=x.index(foo) + len( foo )
            end=x.index(baa)   

            if start < end:
                pass
                number = x[start:end].strip(' ')


            else:
                start=x.index(baa) + len( baa )
                end=x.index(foo) 
        
                number = x[start:end].strip(' ')
            
    except: 
        number ='0' 
    
    print ("clean           "+number)

Output: Output:

raw             1x 80 3x 4x
clean           80
raw             4x 3x 80 1x
clean           80
raw             1x 80 4x 
clean           0
raw             4x 80 1x
clean           0
raw             80 3x 4x
clean           0
raw             4x 3x 80
clean           0
raw             80 4x
clean           0
raw             4x 80
clean           0

this look like a job for regular expressions这看起来像是正则表达式的工作

>>> import re
>>> text="""1x 80 3x 4x
4x 3x 80 1x
1x 80 4x
4x 80 1x
80 3x 4x
4x 3x 80
80 4x
4x 80""".splitlines()
>>> text
['1x 80 3x 4x', '4x 3x 80 1x', '1x 80 4x', '4x 80 1x', '80 3x 4x', '4x 3x 80', '80 4x', '4x 80']
>>> for t in text:
        res = re.search("((1|4). )?(3. )?(?P<result>[^ ]+)(3. )?((1|4). )?",t)
        print(f"raw: {t!r}\nclean: {res['result']!r}")

    
raw: '1x 80 3x 4x'
clean: '80'
raw: '4x 3x 80 1x'
clean: '80'
raw: '1x 80 4x'
clean: '80'
raw: '4x 80 1x'
clean: '80'
raw: '80 3x 4x'
clean: '80'
raw: '4x 3x 80'
clean: '80'
raw: '80 4x'
clean: '80'
raw: '4x 80'
clean: '80'
>>> 

here .在这里. represent any character, (a|b) is a or b for expresions a and b, (...)?表示任意字符, (a|b)是 a 还是 b 表示 a 和 b, (...)? means that the inside is optional so ((1|4). )?意味着内部是可选的所以((1|4). )? mean 1 or 4 alongside one character plus one space and is optional, similar for the others, and for (?P<result>[^ ]+) , (?P<name>...) means that is a group named name , [^ ] is any character but space and the plus sign is that we want one or more表示 1 或 4 旁边的一个字符加一个空格,是可选的,其他类似,对于(?P<result>[^ ]+)(?P<name>...)表示这是一个名为name的组, [^ ]是除空格以外的任何字符,加号是我们想要一个或多个

UPDATE:更新:

import re

POSRE = "(?P<pos>(?:[1-9](?:[\.\: ] ?)?[a-zA-Z](?: |$)))"

def extrator(rawtext):
    result = filter(None,map(str.strip,re.split(POSRE,rawtext)))
    result = [(x,int(x[0]) if re.match(POSRE,x) else None) for x in result]
    pos=[n for x,n in result if n]
    if sorted(pos)!=pos:
        result = list(reversed(result))
    final = [x for x,p in result if p is None]
    if len(final)==2 and 6 not in pos:
        a,b = final
        final = [a,*b.split()]
    elif len(final)<3:
        final.extend([None]*(3-len(final)))
    return final

So the main thing here is identify those position marker which structure is know, for that I device that regular expression were we check if its a single number ( [1-9] ) optionally follow by a .所以这里的主要内容是识别那些 position 标记,哪个结构是已知的,因为我设备该正则表达式是我们检查它是否是单个数字( [1-9] )可选地跟随一个. or : or: ( (?:[\.\: ]?)? ) then a letter ( [a-zA-Z] ) and then another space or the end of the string ( (?: |$) ). (?:[\.\: ]?)? )然后是一个字母( [a-zA-Z] ),然后是另一个空格或字符串的结尾( (?: |$) )。 The (?:...) means that is a not capturing group, for more detail on those check the documentation linked above... (?:...)表示该组不是捕获组,有关这些组的更多详细信息,请查看上面链接的文档...

We use that in re.split to split the text into its matching and not matching parts which are then strip out of their surrounding spaces characters and filter out those that turn out to be empty.我们在re.split中使用它来将文本拆分为匹配和不匹配的部分,然后将其从周围的空格字符中剥离出来,并过滤掉那些结果为空的部分。

We follow that by identifying what is their position if they are a matching string or None if not.我们通过识别它们的 position(如果它们是匹配的字符串)或 None(如果不是)来遵循这一点。

Then is just a couple simple check, like in what order they came and reversed it if needed so we always return in the same order and extract what we need in final , check for the final case and adjust accordingly and done.然后只是几个简单的检查,例如它们以什么顺序出现并在需要时将其反转,因此我们总是以相同的顺序返回并在final中提取我们需要的内容,检查最终情况并进行相应调整并完成。

and a little test和一个小测试

text="""1. A  $67  4. A  69.000kg  6. A  12sec  8. B  9. B
9B 8B 22 sec 6A 75.000kg 4b  $80 1b
10 Mrd 3: A 4: A  50 .379 6: A   7:19   8: B 9: D
9a 8b 10 6b 60000 4a 3 b 50
10 Mrd 4: A  50 .379 6: A   7:19   8: B 9: D
9a 8b 10 6b 60000 4a  50
9B 8B 22 sec 6A 75.000kg 4b 3b  $80 1b
1. A  $67  3. A    69.000kg  6. A  12sec  8. B  9. B
1. A  $67  3. A  4a  69.000kg   12sec  8. B  9. B
9a 8b 6b 4a 3 b 50 1b""".splitlines()

for t in text:
    print(f"raw: {t!r}\nresult: ",extrator(t) )
    print()

which give us这给了我们

raw: '1. A  $67  4. A  69.000kg  6. A  12sec  8. B  9. B'
result:  ['$67', '69.000kg', '12sec']

raw: '9B 8B 22 sec 6A 75.000kg 4b  $80 1b'
result:  ['$80', '75.000kg', '22 sec']

raw: '10 Mrd 3: A 4: A  50 .379 6: A   7:19   8: B 9: D'
result:  ['10 Mrd', '50 .379', '7:19']

raw: '9a 8b 10 6b 60000 4a 3 b 50'
result:  ['50', '60000', '10']

raw: '10 Mrd 4: A  50 .379 6: A   7:19   8: B 9: D'
result:  ['10 Mrd', '50 .379', '7:19']

raw: '9a 8b 10 6b 60000 4a  50'
result:  ['50', '60000', '10']

raw: '9B 8B 22 sec 6A 75.000kg 4b 3b  $80 1b'
result:  ['$80', '75.000kg', '22 sec']

raw: '1. A  $67  3. A    69.000kg  6. A  12sec  8. B  9. B'
result:  ['$67', '69.000kg', '12sec']

raw: '1. A  $67  3. A  4a  69.000kg   12sec  8. B  9. B'
result:  ['$67', '69.000kg', '12sec']

raw: '9a 8b 6b 4a 3 b 50 1b'
result:  ['50', None, None]

UPDATE 2更新 2

here is a version which identify which one is the data we got given a couple of assumption such as:这是一个版本,它可以确定哪一个是我们在几个假设下得到的数据,例如:

  • there is only position markers and data, the data is only a positions 2, 5 and 7只有 position 标记和数据,数据只有位置 2、5 和 7
  • the previous regular expression can identify those position markers前面的正则表达式可以识别那些 position 标记
  • any of those can go missing任何这些都可以 go 丢失
  • and there is no space characters in the data so in case any of the relevant positions markers is missing and less data than expect is found then one of those may be group in one of the data points extracted and thus can be safely str.split , if that is not the case, adjust in those parts accordingly.并且数据中没有空格字符,因此如果缺少任何相关位置标记并且发现的数据少于预期,则其中一个可能被分组到提取的数据点之一中,因此可以安全地进行str.split ,如果不是这种情况,请相应地调整这些部分。

this result in a rather lengthy case by case check I hope is self explanatory and return a dictionary that said who is who.这导致了一个相当冗长的逐案检查,我希望是不言自明的,并返回一本字典,上面写着谁是谁。

surely this can be refined, but no refinement had come to mind.这当然可以细化,但没有细化的想法。

def extrator(rawtext):
    fil  = filter(None,map(str.strip,re.split(POSRE,rawtext)))
    proc = [(x,int(x[0]) if re.match(POSRE,x) else None) for x in fil] #process raw data
    pos  = [p for x,p in proc if p is not None ] #position markers presents
    if sorted(pos)!=pos:
        proc = list(reversed(proc))        
    data = [x for x,p in proc if p is None]
    pos = {p:i for i,(x,p) in enumerate(proc) if p is not None } #pos marker:index of it
    #print(f"{proc=}")
    if len(data)==3:
        return dict(zip((2,5,7),data))
    #from here, a,b,c will represent data in position 2,5 and 7 respectively
    elif len(data)==2:
        a,b = data
        #c = None
        if 3 in pos or 4 in pos:
            if 6 in pos:
                #one of 2, 5 or 7 is missing
                i = proc.index( (a,None) )
                i34 = pos[3] if 3 in pos else pos[4]
                if i < i34:
                    #a is 2, b is 5 or 7
                    j = proc.index( (b,None) )
                    if j < pos[6]:
                        #7 is missing
                        c = None
                    else:
                        #5 is missing
                        b,c = None,b
                else:
                    #2 is missing, a is 5 thus b is 7
                    a,b,c = None,a,b
            else:
                #a is 2, b may be 5 or 7 or both
                t = b.split()
                if len(t) == 2:
                    #b was both
                    b,c = t
                elif len(t) == 1:
                    #b is 5 or 7
                    print("either 5 or 7 is missing, picked 7 as missing")
                    c = None
                else:
                    #b was split into more than 2 parts
                    raise RuntimeError("unknow case 1")
        else:
            #3 and 4 are missing
            if 6 in pos:
                #a may be 2 or 5 or both, b is 7
                c = b
                t = a.split()
                if len(t) == 2:
                    #a was both
                    a,b = t
                elif len(t) == 1:
                    print("either 2 or 5 is missing, picked 5 as missing")
                    b = None
                else:
                    #a was split into more than 2 parts
                    raise RuntimeError("unknow case 2")
            else:
                raise RuntimeError("Fatal error: 2 data points with no marker in between")
        return dict(zip((2,5,7),(a,b,c)))
    elif len(data)==1:
        a = data[0]
        i = proc.index( (a,None) )
        #b,c = None, None
        if 3 in pos or 4 in pos:
            i34 = pos[3] if 3 in pos else pos[4]
            if 6 in pos:
                #only one of 2,5 or 7 are present
                if i < i34:
                    #a is 2 the rest is missing
                    b,c = None, None
                elif i < pos[6]:
                    #a is 5
                    a,b,c = None, a, None
                else:
                    #a is 7
                    a,b,c = None, None, a 
            else:
                #a is 2 or a is 5 or 7 or both
                if i < i34:
                    #a is 2, the rest is missing
                    b,c = None, None
                else:
                    #2 is missing, a is 5 or 7 or both 
                    a,b = None, a
                    t = b.split()
                    if len(t) == 2:
                        b,c = t
                    elif len(t) == 1:
                        print("either 5 or 7 is missing, picked 7 as missing")
                        c = None
                    else:
                        raise RuntimeError("unknow case 3")
        else:
            #3 and 4 are missing
            if 6 in pos:
                if pos[6] < i:
                    #a is 7, the rest is missing
                    a,b,c = None, None, a
                else:
                    #7 is missing, a is 2 or 5 or both
                    c = None
                    t = a.split()
                    if len(t) == 2:
                        a,b = t
                    elif len(t) == 1:
                        print("either 2 or 5 is missing, picked 5 as missing")
                        b = None
                    else:
                        raise RuntimeError("unknow case 4")
            else:
                #a is 2, 5 or 7 or any combination of them
                t = a.split()
                if len(t) == 3:
                    a,b,c = t
                elif len(t) == 2:
                    print("one of 2, 5 or 7 is missing, picked 7 as missing")
                    a,b = t
                    c = None
                elif len(t) == 1:
                    print("only one of 2, 5 or 7 is present, picked 2 as present")
                    b,c = None, None
                else:
                    raise RuntimeError("unknow case 5")
        return dict(zip((2,5,7),(a,b,c)))
    elif len(data) == 0:
        return dict.fromkeys( (2,5,7) )
    else:
        raise RuntimeError("unknow case 6: more than 3 data points")


def test():
    text="""1. A  $67  4. A  69.000kg  6. A  12sec  8. B  9. B
9B 8B 22 sec 6A 75.000kg 4b  $80 1b
10 Mrd 3: A 4: A  50 .379 6: A   7:19   8: B 9: D
9a 8b 10 6b 60000 4a 3 b 50
10 Mrd 4: A  50 .379 6: A   7:19   8: B 9: D
9a 8b 10 6b 60000 4a  50
9B 8B 22 sec 6A 75.000kg 4b 3b  $80 1b
1. A  $67  3. A    69.000kg  6. A  12sec  8. B  9. B
1. A  $67  3. A  4a  69.000kg   12sec  8. B  9. B
9a 8b 6b 4a 3 b 50 1b
9 a 8b 6 b 55 4a 3 b 1b
9a 8:b 777 6 b 4.a 3 b 1b
9a 8:b 777 6 b 4.a 3 b 55 1b
""".splitlines()

    for t in text:
        print(f"raw: {t!r}\nresult: ",extrator(t) )
        print()

output output

>>> test()
raw: '1. A  $67  4. A  69.000kg  6. A  12sec  8. B  9. B'
result:  {2: '$67', 5: '69.000kg', 7: '12sec'}

raw: '9B 8B 22 sec 6A 75.000kg 4b  $80 1b'
result:  {2: '$80', 5: '75.000kg', 7: '22 sec'}

raw: '10 Mrd 3: A 4: A  50 .379 6: A   7:19   8: B 9: D'
result:  {2: '10 Mrd', 5: '50 .379', 7: '7:19'}

raw: '9a 8b 10 6b 60000 4a 3 b 50'
result:  {2: '50', 5: '60000', 7: '10'}

raw: '10 Mrd 4: A  50 .379 6: A   7:19   8: B 9: D'
result:  {2: '10 Mrd', 5: '50 .379', 7: '7:19'}

raw: '9a 8b 10 6b 60000 4a  50'
result:  {2: '50', 5: '60000', 7: '10'}

raw: '9B 8B 22 sec 6A 75.000kg 4b 3b  $80 1b'
result:  {2: '$80', 5: '75.000kg', 7: '22 sec'}

raw: '1. A  $67  3. A    69.000kg  6. A  12sec  8. B  9. B'
result:  {2: '$67', 5: '69.000kg', 7: '12sec'}

raw: '1. A  $67  3. A  4a  69.000kg   12sec  8. B  9. B'
result:  {2: '$67', 5: '69.000kg', 7: '12sec'}

raw: '9a 8b 6b 4a 3 b 50 1b'
result:  {2: '50', 5: None, 7: None}

raw: '9 a 8b 6 b 55 4a 3 b 1b'
result:  {2: None, 5: '55', 7: None}

raw: '9a 8:b 777 6 b 4.a 3 b 1b'
result:  {2: None, 5: None, 7: '777'}

raw: '9a 8:b 777 6 b 4.a 3 b 55 1b'
result:  {2: '55', 5: None, 7: '777'}

>>> 

If I understand you goals correctly, it seems to me like you are far over-complicating the process.如果我正确理解了您的目标,那么在我看来,您的过程过于复杂。 I wrote a function which splits the input string to a list and checks whether each segment meets the formatting of 1x , 2x , or 3x .我写了一个 function 将输入字符串拆分为一个列表并检查每个段是否符合1x2x3x的格式。 Check it out and let me know if it's not what you need.检查一下,让我知道它是否不是您需要的。

# we use regex to check for matches with the format
import re

#list of strings
x_list = ["1x 80 3x 4x","4x 3x 80 1x"]
for x in x_list:
    print(find_substr(x))

def find_substr(x):
    # break on spaces into a list
    seg = x.split(" ")
    # check each word for the desired format
    for i in range(len(seg)):
        for j in [1,3,4]:
            if re.search(str(j)+".", seg[i]) is None:
                # this word does not fit the format, so it's the substring
                return seg

Changed the code up a little to be more readable.将代码更改了一点以更具可读性。 I didn't know if this is what you wanted but it works.我不知道这是否是您想要的,但它有效。 Let me know if you have any questions.如果您有任何问题,请告诉我。 I would be happy to help我很乐意提供帮助

for x in x_list:
    print ("raw             "+x)

    try:
        # splits the string into a list, separating on spaces (e.g ['1x', '80', '3x', '4x'])
        y = x.split(" ")

        # a is the substring that you are checking in the list
        a = '80'
        if a in y:
            index = y.index(a)
            number = y[index]

    except: 
        number ='0' 
    
    print ("clean           "+number)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM