简体   繁体   中英

Python split before a certain character

I have following string:

BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6

I am trying to split it in a way I would get back the following dict / other data structure:

BUCKET1 -> /dir1/dir2/, BUCKET1 -> /dir3/dir4/, BUCKET2 -> /dir5/dir6/

I can somehow split it if I only have one BUCKET, not multiple, like this:

res.split(res.split(':', 1)[0].replace('.', '').upper()) -> it's not perfect 

Input: ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/

Output: [(ADRIAN, /dir1/dir11), (DANIEL, /dir2/), (CULEA, /dir3/), (ADRIAN, /dir5/), (ADRIAN, /dir6/)


As per Wiktor Stribiżew comments, the following regex does the job:

 r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"

Use re.findall() function:

s = "ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/"
result = re.findall(r'(\w+):([^:]+\/)', s)

print(result)

The output:

[('ADRIAN', '/dir1/dir11/'), ('DANIEL', '/dir2/'), ('ADI_BUCKET', '/dir3/'), ('CULEA', '/dir4/'), ('ADRIAN', '/dir5/'), ('ADRIAN', '/dir6/')]

If you're experienced, I'd recommend learning Regex just as the others have suggested. However, if you're looking for an alternative, here's a way of doing such without Regex. It also produces the output you're looking for.

string = input("Enter:") #Put your own input here.

tempList = string.replace("BUCKET",':').split(":")
outputList = []
for i in range(1,len(tempList)-1,2):
    someTuple = ("BUCKET"+tempList[i],tempList[i+1])
    outputList.append(someTuple)

print(outputList) #Put your own output here.

This will produce:

[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]

This code is hopefully easier to understand and manipulate if you're unfamiliar with Regex, although I'd still personally recommend Regex to solve this if you're familiar with how to use it.

Use regex instead?

impore re
test = 'BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6'

output = re.findall(r'(?P<bucket>[A-Z0-9]+):(?P<path>[/a-z0-9]+)', test)
print(output)

Which gives

[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]

It appears you have a list of predefined "buckets" that you want to use as boundaries for the records inside the string.

That means, the easiest way to match these key-value pairs is by matching one of the buckets, then a colon and then any chars not starting a sequence of chars equal to those bucket names .

You may use

r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"

Compile with re.S / re.DOTALL if your values span across multiple lines. See the regex demo .

Details :

  • (BUCKET1|BUCKET2) - capture group one that matches and stores in .group(1) any of the bucket names
  • : - a colon
  • (.*?) - any 0+ chars, as few as possible (as *? is a lazy quantifier), up to the first occurrence of (but not inlcuding)...
  • (?=(?:BUCKET1|BUCKET2)|$) - any of the bucket names or end of string.

Build it dynamically while escaping bucket names (just to play it safe in case those names contain * or + or other special chars):

import re
buckets = ['BUCKET1','BUCKET2']
rx = r"({0}):(.*?)(?=(?:{0})|$)".format("|".join([re.escape(bucket) for bucket in buckets]))
print(rx)
s = "BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6"
print(re.findall(rx, s))
# => (BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)
     [('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]

See the online Python demo .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM