[英]Extract string and number from a string which is in multiple format using regex in python?
I am trying to parse a string using regex which is in particular format to get details out of it.我正在尝试使用特定格式的正则表达式来解析字符串,以从中获取详细信息。 I can have my string in two formats -
我可以有两种格式的字符串 -
First format第一格式
One way is to have a foldername-version.tgz
.一种方法是使用
foldername-version.tgz
。 Here foldername
can be any string in any format.这里的
foldername
可以是任意格式的任意字符串。 It can have another or multiple -
in it or anything else.它可以有另一个或多个
-
在它或其他任何东西中。
For example:例如:
FolderName
as hello
and Version
as 1234
FolderName
as hello
和Version
as 1234
FolderName
as world
and Version
as 12345
FolderName
作为world
和Version
作为12345
FolderName
as hello-21234
and Version
as 12345
FolderName
as hello-21234
和Version
as 12345
FolderName
as hello-21234-a
and Version
as 12345
FolderName
as hello-21234-a
和Version
as 12345
Second format第二种格式
Other way is to have foldername-version-environment.tgz
.其他方法是拥有
foldername-version-environment.tgz
。 In this case also foldername
can be any string in any format.在这种情况下,
foldername
也可以是任何格式的任何字符串。 Also environment string can only be dev
, stage
, prod
and nothing else so I need to add check on that as well.此外,环境字符串只能是
dev
、 stage
、 prod
,不能是其他任何东西,所以我也需要添加检查。
For example:例如:
FolderName
as hello
and Version
as 1234
FolderName
as hello
和Version
as 1234
FolderName
as world
and Version
as 12345
FolderName
作为world
和Version
作为12345
FolderName
as hello-21234
and Version
as 12345
FolderName
as hello-21234
和Version
as 12345
FolderName
as hello-21234-a
and Version
as 12345
FolderName
as hello-21234-a
和Version
as 12345
Problem Statement问题陈述
So with the above two format - I need to extract FolderName
and Version
from my string.因此,对于上述两种格式——我需要从我的字符串中提取
FolderName
和Version
。 I tried with below regex but it doesn't work on my strings which are in second format but I want my code to work on both the formats.我尝试使用以下正则表达式,但它不适用于我的第二种格式的字符串,但我希望我的代码适用于这两种格式。
#sample example string which can be in first or second format
exampleString = hello-21234-12345-prod.tgz
build_found = re.search(r'[\d.-]+.tgz', exampleString)
version = build_found.group().replace(".tgz", "")
folderName = exampleString.split(version)[0]
What is wrong I am doing here?我在这里做错了什么?
I would use:我会用:
inp = "some text hello-21234-a-12345.tgz some more text"
parts = re.findall(r'\b([^\s-]+(?:-[^-]+)*)-(\d+)(?:-[^-]+)*\.\w+\b', inp)
print("FolderName: " + parts[0][0])
print("Version: " + parts[0][1])
This prints:这打印:
FolderName: hello-21234-a
Version: 12345
You need to use a regular expression that captures the components you're looking for within the string, then use .groups()
to extract the captures.您需要使用正则表达式来捕获您在字符串中查找的组件,然后使用
.groups()
来提取捕获的内容。 This worked in my testing:这在我的测试中有效:
re.search(r'^(.+)-(\d+)\D*$', exampleString)
example in ipython: ipython 中的示例:
In [1]: import re
In [2]: s1 = 'hello-21234-12345-prod.tgz'
In [3]: s2 = 'hello-1234.tgz'
In [4]: re.search(r'^(.+)-(\d+)\D*$', s1).groups()
Out[4]: ('hello-21234', '12345')
In [5]: re.search(r'^(.+)-(\d+)\D*$', s2).groups()
Out[5]: ('hello', '1234')
The trick is the capture groups ( (...)
) within the regular expression r'^(.+)-(\d+)\D*$'
.诀窍是正则表达式
r'^(.+)-(\d+)\D*$'
中的捕获组 ( (...)
)。 There are two groups - it's actually easier to decode it by looking at the second capture group first, then the first.有两组 - 首先查看第二个捕获组,然后再查看第一个,实际上更容易对其进行解码。
The second part of the regex - r'(\d+)\D*$'
matches the final series of \d
digits.正则表达式的第二部分 -
r'(\d+)\D*$'
匹配最终的\d
数字系列。 You know it is the final series of digits, because the \D*$
part will match and swallow up all non-digit characters up to the end of the string.你知道这是最后的数字系列,因为
\D*$
部分将匹配并吞掉所有非数字字符,直到字符串的末尾。
The first part of the regex - r'^(.+)-'
matches everything before the second part.正则表达式的第一部分 -
r'^(.+)-'
匹配第二部分之前的所有内容。 It captures everything except the "-"
character, and gives you the FolderName它捕获除
"-"
字符以外的所有内容,并为您提供 FolderName
Note that you'll need something a bit more complex if you have any digit characters in your environment
or in the file ending (such as if you're using bzip2 compression)请注意,如果您的
environment
或文件结尾中有任何数字字符(例如您使用的是 bzip2 压缩),则需要一些更复杂的东西
Use groups to specify the different sections of the pattern.使用组来指定模式的不同部分。 You can name them for easier extraction later, too:
您也可以命名它们以便以后更容易提取:
pattern = re.compile(r"(?P<FolderName>.+)-(?P<Version>\d+)(?:-(?P<Env>dev|stage|prod))?\.tgz")
m = pattern.match(ex)
print(m.groups())
# ('hello-21234', '12345', 'prod')
print(m.group('FolderName'), m.group('Version'), m.group('Env'))
# ('hello-21234', '12345', 'prod')
ex2 = "hello-21234-1234.tgz" # No environment
m = pattern.match(ex)
print(m.groups())
# ('hello-21234', '12345', None)
print(m.group('FolderName'), m.group('Version'), m.group('Env'))
# ('hello-21234', '12345', None)
See if this pattern works看看这个模式是否有效
import re
exampleString = 'hello-21234-12345-prod.tgz'
build_found = re.search(r'([\w-]+)-(\d+)-(dev|stage|prod)?', exampleString)
folder_name = build_found[1]
version = build_found[2]
environment = build_found[3]
print(folder_name)
print(version)
print(environment)
Output Output
hello-21234
12345
prod
Surely not the best approach, but here's one idea.当然不是最好的方法,但这是一个想法。
Start by determining whether you have the first or second case.首先确定您是第一种还是第二种情况。
-(dev|stage|prod)\.tgz$
This regex will determine whether or not you have case 1 or 2.此正则表达式将确定您是否有案例 1 或案例 2。
If it's case 1, you can extract the foldername with:如果是情况 1,您可以使用以下方法提取文件夹名称:
.*-
And you can extract the version with:您可以使用以下方法提取版本:
-\d+.tgz$
If it's case 2, you can extract the combined foldername/versionnumber with:如果是情况 2,您可以使用以下方法提取组合的文件夹名称/版本号:
.*-
From there, you can extract the foldername with (again):从那里,您可以(再次)提取文件夹名称:
.*-
And the version number with:以及版本号:
-\d+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.