简体   繁体   English

Python在特定字符之前拆分

[英]Python split before a certain character

I have following string: 我有以下字符串:

BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6

I am trying to split it in a way I would get back the following dict / other data structure: 我试图以一种方式拆分它,我会得到以下dict /其他数据结构:

BUCKET1 -> /dir1/dir2/, BUCKET1 -> /dir3/dir4/, BUCKET2 -> /dir5/dir6/

I can somehow split it if I only have one BUCKET, not multiple, like this: 如果我只有一个BUCKET而不是多个,我可以以某种方式拆分它,像这样:

res.split(res.split(':', 1)[0].replace('.', '').upper()) -> it's not perfect 

Input: ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/ 输入:ADRIAN:/ dir1 / dir11 / DANIEL:/ dir2 / ADI_BUCKET:/ dir3 / CULEA:/ dir4 / ADRIAN:/ dir5 / ADRIAN:/ dir6 /

Output: [(ADRIAN, /dir1/dir11), (DANIEL, /dir2/), (CULEA, /dir3/), (ADRIAN, /dir5/), (ADRIAN, /dir6/) 输出:[(ADRIAN,/ dir1 / dir11),(DANIEL,/ dir2 /),(CULEA,/ dir3 /),(ADRIAN,/ dir5 /),(ADRIAN,/ dir6 /)


As per Wiktor Stribiżew comments, the following regex does the job: 根据WiktorStribiżew的评论,以下正则表达式完成了这项工作:

 r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"

Use re.findall() function: 使用re.findall()函数:

s = "ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/"
result = re.findall(r'(\w+):([^:]+\/)', s)

print(result)

The output: 输出:

[('ADRIAN', '/dir1/dir11/'), ('DANIEL', '/dir2/'), ('ADI_BUCKET', '/dir3/'), ('CULEA', '/dir4/'), ('ADRIAN', '/dir5/'), ('ADRIAN', '/dir6/')]

If you're experienced, I'd recommend learning Regex just as the others have suggested. 如果您有经验,我建议像其他人建议的那样学习正则表达式。 However, if you're looking for an alternative, here's a way of doing such without Regex. 但是,如果您正在寻找替代方案,这里有一种没有正则表达式的方法。 It also produces the output you're looking for. 它还可以生成您正在寻找的输出。

string = input("Enter:") #Put your own input here.

tempList = string.replace("BUCKET",':').split(":")
outputList = []
for i in range(1,len(tempList)-1,2):
    someTuple = ("BUCKET"+tempList[i],tempList[i+1])
    outputList.append(someTuple)

print(outputList) #Put your own output here.

This will produce: 这将产生:

[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]

This code is hopefully easier to understand and manipulate if you're unfamiliar with Regex, although I'd still personally recommend Regex to solve this if you're familiar with how to use it. 如果您不熟悉Regex,这个代码有望更容易理解和操作,但如果您熟悉如何使用它,我仍然会亲自推荐Regex来解决这个问题。

Use regex instead? 使用正则表达式?

impore re
test = 'BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6'

output = re.findall(r'(?P<bucket>[A-Z0-9]+):(?P<path>[/a-z0-9]+)', test)
print(output)

Which gives 这使

[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]

It appears you have a list of predefined "buckets" that you want to use as boundaries for the records inside the string. 您似乎有一个预定义“桶”列表,您希望将其用作字符串内记录的边界。

That means, the easiest way to match these key-value pairs is by matching one of the buckets, then a colon and then any chars not starting a sequence of chars equal to those bucket names . 这意味着,匹配这些键值对的最简单方法是匹配其中一个桶,然后是冒号,然后是任何不启动字符序列的字符等于那些桶名称

You may use 你可以用

r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"

Compile with re.S / re.DOTALL if your values span across multiple lines. 如果您的值跨越多行,请使用re.S / re.DOTALL编译。 See the regex demo . 请参阅正则表达式演示

Details : 细节

  • (BUCKET1|BUCKET2) - capture group one that matches and stores in .group(1) any of the bucket names (BUCKET1|BUCKET2) - 捕获组匹配并存储在.group(1)任何存储桶名称
  • : - a colon : - 冒号
  • (.*?) - any 0+ chars, as few as possible (as *? is a lazy quantifier), up to the first occurrence of (but not inlcuding)... (.*?) - 任何0+字符,尽可能少(因为*?是一个懒惰的量词),直到第一次出现(但不包括)......
  • (?=(?:BUCKET1|BUCKET2)|$) - any of the bucket names or end of string. (?=(?:BUCKET1|BUCKET2)|$) - 任何存储桶名称或字符串结尾。

Build it dynamically while escaping bucket names (just to play it safe in case those names contain * or + or other special chars): 在转义存储桶名称时动态构建它(只是为了安全起见,以防这些名称包含*+或其他特殊字符):

import re
buckets = ['BUCKET1','BUCKET2']
rx = r"({0}):(.*?)(?=(?:{0})|$)".format("|".join([re.escape(bucket) for bucket in buckets]))
print(rx)
s = "BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6"
print(re.findall(rx, s))
# => (BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)
     [('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]

See the online Python demo . 查看在线Python演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM