简体   繁体   中英

Is there an efficient way to split a string between two strings?

I'm trying to split a string to extract some required string pieces from inside that string. The string I have is as shown :

s='conf/icdcs/BarbaraGS86|conf/icdcs/ShethL86|conf/icde/BhargavaMRS89|conf/icde/BhargavaNS88|conf/icde/BhargavaR88|conf/icde/ElmagarmidH88|conf/infocom/BadalM84|conf/sigmod/Skeen81|conf/sosp/PresottoM83|conf/vldb/Gray81|journals/cacm/EswarranGLT76|journals/cacm/Lamport78|journals/computer/Alexandridis86|journals/computer/Goguen86|journals/computer/KartashevK86|journals/csur/BernsteinG81|journals/csur/DavidsonG85|journals/csur/Kohler81|journals/jacm/Papadimitriou79b|journals/tc/Avizinis76|journals/tc/Garcia-Molina82|journals/tocs/BirrelN84|journals/tocs/CheritonZ85|journals/tocs/Reed83|journals/tods/Herlihy87|journals/tods/KungR81|journals/tse/BhargavaR89|journals/tse/BlackHJLC87|journals/tse/Randell75|journals/tse/SkeenS83'

I want to extract the sub string after 'conf' between the two forward slashes.

conf/icdcs/ShethL86
conf/icde/BhargavaMRS89
conf/icde/BhargavaNS88

Thus for the above strings I want to extract:

icdcs
icde
icde

I have managed to write the following code to extract the required value:

def find_between(s, start, end):
    return (s.split(start))[1].split(end)[0]

start = 'conf/'
end = '/'
res=find_between(s,start,end)

But it only works to extract one time. I want to be able to extract all the sub-strings in the string, preferably into a list.

split() is your friend. if you know you always want to get what is after conf/ , then split the sting on that first.

print(s.split('conf/'))
# ['', 'icdcs/BarbaraGS86|',
#  'icdcs/ShethL86|',
#  'icde/BhargavaMRS89|',
#  'icde/BhargavaNS88|',
#  'icde/BhargavaR88|',
#  'icde/ElmagarmidH88|',
#  'infocom/BadalM84|',
#  'sigmod/Skeen81|',
#  'sosp/PresottoM83|',
#  'vldb/Gray81|journals/cacm/EswarranGLT76|journals/cacm/Lamport78|journals/computer/Alexandridis86|journals/computer/Goguen86|journals/computer/KartashevK86|journals/csur/BernsteinG81|journals/csur/DavidsonG85|journals/csur/Kohler81|journals/jacm/Papadimitriou79b|journals/tc/Avizinis76|journals/tc/Garcia-Molina82|journals/tocs/BirrelN84|journals/tocs/CheritonZ85|journals/tocs/Reed83|journals/tods/Herlihy87|journals/tods/KungR81|journals/tse/BhargavaR89|journals/tse/BlackHJLC87|journals/tse/Randell75|journals/tse/SkeenS83']

Then you can split the resulting strings on the next / and take the first part of each item.

confs = [i.split('/')[0] for i in s.split('conf/') if i.strip()]

print(confs)
# ['icdcs', 'icdcs', 'icde', 'icde', 'icde', 'icde', 'infocom', 'sigmod', 'sosp', 'vldb']

If you just want the unique values, you can use set() to remove duplicates.

print(set(confs))
# {'vldb', 'sigmod', 'icdcs', 'sosp', 'icde', 'infocom'}

I see a bunch of the other answers are splitting on | , which is fine, but this does create more items in the list to iterate through than seems necessary given your input. Splitting on conf/ guarantees that each item has something of value. You just take the first part of each and you're on your way.

Using Regex. re.findall --> Lookbehind & Lookahead

Ex:

import re

s='conf/icdcs/BarbaraGS86|conf/icdcs/ShethL86|conf/icde/BhargavaMRS89|conf/icde/BhargavaNS88|conf/icde/BhargavaR88|conf/icde/ElmagarmidH88|conf/infocom/BadalM84|conf/sigmod/Skeen81|conf/sosp/PresottoM83|conf/vldb/Gray81|journals/cacm/EswarranGLT76|journals/cacm/Lamport78|journals/computer/Alexandridis86|journals/computer/Goguen86|journals/computer/KartashevK86|journals/csur/BernsteinG81|journals/csur/DavidsonG85|journals/csur/Kohler81|journals/jacm/Papadimitriou79b|journals/tc/Avizinis76|journals/tc/Garcia-Molina82|journals/tocs/BirrelN84|journals/tocs/CheritonZ85|journals/tocs/Reed83|journals/tods/Herlihy87|journals/tods/KungR81|journals/tse/BhargavaR89|journals/tse/BlackHJLC87|journals/tse/Randell75|journals/tse/SkeenS83'
start = 'conf/'
end = '/'

print(re.findall(r"(?<={}).*?(?={})".format(re.escape(start),re.escape(end)), s)) 

Output:

['icdcs', 'icdcs', 'icde', 'icde', 'icde', 'icde', 'infocom', 'sigmod', 'sosp', 'vldb']

Just use split :

prefix = 'conf/'
substrings = [p.split('/')[1] for p in s.split('|') if p.startswith(prefix)]
print(substrings)

Your answer only extracts once because you are choosing only the first result from your split(start) :

s.split(start)
['', 'icdcs/BarbaraGS86|', 'icdcs/ShethL86|', 'icde/BhargavaMRS89|', 'icde/BhargavaNS88|', 'icde/BhargavaR88|', 'icde/ElmagarmidH88|', 'infocom/BadalM84|', 'sigmod/Skeen81|', 'sosp/PresottoM83|', 'vldb/Gray81|journals/cacm/EswarranGLT76|journals/cacm/Lamport78|journals/computer/Alexandridis86|journals/computer/Goguen86|journals/computer/KartashevK86|journals/csur/BernsteinG81|journals/csur/DavidsonG85|journals/csur/Kohler81|journals/jacm/Papadimitriou79b|journals/tc/Avizinis76|journals/tc/Garcia-Molina82|journals/tocs/BirrelN84|journals/tocs/CheritonZ85|journals/tocs/Reed83|journals/tods/Herlihy87|journals/tods/KungR81|journals/tse/BhargavaR89|journals/tse/BlackHJLC87|journals/tse/Randell75|journals/tse/SkeenS83']

By choosing just split(start)[1] , you are only getting `'icdcs/BarbaraGS86|'. So you know your final logic is sound, you just want to be able to pick out all the rest of the results. A list comprehension should work perfectly:

[x.split(end)[0] for x in s.split(start) if x]

This will iterate over all of your results. However, the problem is that you still have a lot of trailing results at the end that don't have conf in them. You could skip those with a list slice like

# -2 will grab everything except the last result
s.split('\conf')[:-2]

Or you could physically ignore them like:

[x for x in s.split('|') if x.startswith('conf/')

I think the latter is a bit more robust and more readable, since for a general application of this logic, I wouldn't guarantee the location of bad results, and the slice could remove things you actually want.

So in total, your function could look like:

def find_between(s, start, end):
    for x in s.split('|'):
        if x.startswith(start):
            # yield here will allow you to iterate
            # over the function
            yield x.split(start)[1].split(end)[0]

s='conf/icdcs/BarbaraGS86|conf/icdcs/ShethL86|conf/icde/BhargavaMRS89|conf/icde/BhargavaNS88|conf/icde/BhargavaR88|conf/icde/ElmagarmidH88|conf/infocom/BadalM84|conf/sigmod/Skeen81|conf/sosp/PresottoM83|conf/vldb/Gray81|journals/cacm/EswarranGLT76|journals/cacm/Lamport78|journals/computer/Alexandridis86|journals/computer/Goguen86|journals/computer/KartashevK86|journals/csur/BernsteinG81|journals/csur/DavidsonG85|journals/csur/Kohler81|journals/jacm/Papadimitriou79b|journals/tc/Avizinis76|journals/tc/Garcia-Molina82|journals/tocs/BirrelN84|journals/tocs/CheritonZ85|journals/tocs/Reed83|journals/tods/Herlihy87|journals/tods/KungR81|journals/tse/BhargavaR89|journals/tse/BlackHJLC87|journals/tse/Randell75|journals/tse/SkeenS83'

start = 'conf/'
end = '/'

a = [x for x in find_between(s, start, end)]

# ['icdcs', 'icdcs', 'icde', 'icde', 'icde', 'icde', 'infocom', 'sigmod', 'sosp', 'vldb']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM