I'm trying to split a string to extract some required string pieces from inside that string. The string I have is as shown :
s='conf/icdcs/BarbaraGS86|conf/icdcs/ShethL86|conf/icde/BhargavaMRS89|conf/icde/BhargavaNS88|conf/icde/BhargavaR88|conf/icde/ElmagarmidH88|conf/infocom/BadalM84|conf/sigmod/Skeen81|conf/sosp/PresottoM83|conf/vldb/Gray81|journals/cacm/EswarranGLT76|journals/cacm/Lamport78|journals/computer/Alexandridis86|journals/computer/Goguen86|journals/computer/KartashevK86|journals/csur/BernsteinG81|journals/csur/DavidsonG85|journals/csur/Kohler81|journals/jacm/Papadimitriou79b|journals/tc/Avizinis76|journals/tc/Garcia-Molina82|journals/tocs/BirrelN84|journals/tocs/CheritonZ85|journals/tocs/Reed83|journals/tods/Herlihy87|journals/tods/KungR81|journals/tse/BhargavaR89|journals/tse/BlackHJLC87|journals/tse/Randell75|journals/tse/SkeenS83'
I want to extract the sub string after 'conf' between the two forward slashes.
conf/icdcs/ShethL86
conf/icde/BhargavaMRS89
conf/icde/BhargavaNS88
Thus for the above strings I want to extract:
icdcs
icde
icde
I have managed to write the following code to extract the required value:
def find_between(s, start, end):
return (s.split(start))[1].split(end)[0]
start = 'conf/'
end = '/'
res=find_between(s,start,end)
But it only works to extract one time. I want to be able to extract all the sub-strings in the string, preferably into a list.
split()
is your friend. if you know you always want to get what is after conf/
, then split the sting on that first.
print(s.split('conf/'))
# ['', 'icdcs/BarbaraGS86|',
# 'icdcs/ShethL86|',
# 'icde/BhargavaMRS89|',
# 'icde/BhargavaNS88|',
# 'icde/BhargavaR88|',
# 'icde/ElmagarmidH88|',
# 'infocom/BadalM84|',
# 'sigmod/Skeen81|',
# 'sosp/PresottoM83|',
# 'vldb/Gray81|journals/cacm/EswarranGLT76|journals/cacm/Lamport78|journals/computer/Alexandridis86|journals/computer/Goguen86|journals/computer/KartashevK86|journals/csur/BernsteinG81|journals/csur/DavidsonG85|journals/csur/Kohler81|journals/jacm/Papadimitriou79b|journals/tc/Avizinis76|journals/tc/Garcia-Molina82|journals/tocs/BirrelN84|journals/tocs/CheritonZ85|journals/tocs/Reed83|journals/tods/Herlihy87|journals/tods/KungR81|journals/tse/BhargavaR89|journals/tse/BlackHJLC87|journals/tse/Randell75|journals/tse/SkeenS83']
Then you can split the resulting strings on the next /
and take the first part of each item.
confs = [i.split('/')[0] for i in s.split('conf/') if i.strip()]
print(confs)
# ['icdcs', 'icdcs', 'icde', 'icde', 'icde', 'icde', 'infocom', 'sigmod', 'sosp', 'vldb']
If you just want the unique values, you can use set()
to remove duplicates.
print(set(confs))
# {'vldb', 'sigmod', 'icdcs', 'sosp', 'icde', 'infocom'}
I see a bunch of the other answers are splitting on |
, which is fine, but this does create more items in the list to iterate through than seems necessary given your input. Splitting on conf/
guarantees that each item has something of value. You just take the first part of each and you're on your way.
Using Regex. re.findall
--> Lookbehind & Lookahead
Ex:
import re
s='conf/icdcs/BarbaraGS86|conf/icdcs/ShethL86|conf/icde/BhargavaMRS89|conf/icde/BhargavaNS88|conf/icde/BhargavaR88|conf/icde/ElmagarmidH88|conf/infocom/BadalM84|conf/sigmod/Skeen81|conf/sosp/PresottoM83|conf/vldb/Gray81|journals/cacm/EswarranGLT76|journals/cacm/Lamport78|journals/computer/Alexandridis86|journals/computer/Goguen86|journals/computer/KartashevK86|journals/csur/BernsteinG81|journals/csur/DavidsonG85|journals/csur/Kohler81|journals/jacm/Papadimitriou79b|journals/tc/Avizinis76|journals/tc/Garcia-Molina82|journals/tocs/BirrelN84|journals/tocs/CheritonZ85|journals/tocs/Reed83|journals/tods/Herlihy87|journals/tods/KungR81|journals/tse/BhargavaR89|journals/tse/BlackHJLC87|journals/tse/Randell75|journals/tse/SkeenS83'
start = 'conf/'
end = '/'
print(re.findall(r"(?<={}).*?(?={})".format(re.escape(start),re.escape(end)), s))
Output:
['icdcs', 'icdcs', 'icde', 'icde', 'icde', 'icde', 'infocom', 'sigmod', 'sosp', 'vldb']
Just use split
:
prefix = 'conf/'
substrings = [p.split('/')[1] for p in s.split('|') if p.startswith(prefix)]
print(substrings)
Your answer only extracts once because you are choosing only the first result from your split(start)
:
s.split(start)
['', 'icdcs/BarbaraGS86|', 'icdcs/ShethL86|', 'icde/BhargavaMRS89|', 'icde/BhargavaNS88|', 'icde/BhargavaR88|', 'icde/ElmagarmidH88|', 'infocom/BadalM84|', 'sigmod/Skeen81|', 'sosp/PresottoM83|', 'vldb/Gray81|journals/cacm/EswarranGLT76|journals/cacm/Lamport78|journals/computer/Alexandridis86|journals/computer/Goguen86|journals/computer/KartashevK86|journals/csur/BernsteinG81|journals/csur/DavidsonG85|journals/csur/Kohler81|journals/jacm/Papadimitriou79b|journals/tc/Avizinis76|journals/tc/Garcia-Molina82|journals/tocs/BirrelN84|journals/tocs/CheritonZ85|journals/tocs/Reed83|journals/tods/Herlihy87|journals/tods/KungR81|journals/tse/BhargavaR89|journals/tse/BlackHJLC87|journals/tse/Randell75|journals/tse/SkeenS83']
By choosing just split(start)[1]
, you are only getting `'icdcs/BarbaraGS86|'. So you know your final logic is sound, you just want to be able to pick out all the rest of the results. A list comprehension should work perfectly:
[x.split(end)[0] for x in s.split(start) if x]
This will iterate over all of your results. However, the problem is that you still have a lot of trailing results at the end that don't have conf
in them. You could skip those with a list slice like
# -2 will grab everything except the last result
s.split('\conf')[:-2]
Or you could physically ignore them like:
[x for x in s.split('|') if x.startswith('conf/')
I think the latter is a bit more robust and more readable, since for a general application of this logic, I wouldn't guarantee the location of bad results, and the slice could remove things you actually want.
So in total, your function could look like:
def find_between(s, start, end):
for x in s.split('|'):
if x.startswith(start):
# yield here will allow you to iterate
# over the function
yield x.split(start)[1].split(end)[0]
s='conf/icdcs/BarbaraGS86|conf/icdcs/ShethL86|conf/icde/BhargavaMRS89|conf/icde/BhargavaNS88|conf/icde/BhargavaR88|conf/icde/ElmagarmidH88|conf/infocom/BadalM84|conf/sigmod/Skeen81|conf/sosp/PresottoM83|conf/vldb/Gray81|journals/cacm/EswarranGLT76|journals/cacm/Lamport78|journals/computer/Alexandridis86|journals/computer/Goguen86|journals/computer/KartashevK86|journals/csur/BernsteinG81|journals/csur/DavidsonG85|journals/csur/Kohler81|journals/jacm/Papadimitriou79b|journals/tc/Avizinis76|journals/tc/Garcia-Molina82|journals/tocs/BirrelN84|journals/tocs/CheritonZ85|journals/tocs/Reed83|journals/tods/Herlihy87|journals/tods/KungR81|journals/tse/BhargavaR89|journals/tse/BlackHJLC87|journals/tse/Randell75|journals/tse/SkeenS83'
start = 'conf/'
end = '/'
a = [x for x in find_between(s, start, end)]
# ['icdcs', 'icdcs', 'icde', 'icde', 'icde', 'icde', 'infocom', 'sigmod', 'sosp', 'vldb']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.