在 python 中使用正则表达式从多种格式的字符串中提取字符串和数字？

Question

I am trying to parse a string using regex which is in particular format to get details out of it.我正在尝试使用特定格式的正则表达式来解析字符串，以从中获取详细信息。 I can have my string in two formats -我可以有两种格式的字符串 -

First format第一格式

One way is to have a foldername-version.tgz .一种方法是使用foldername-version.tgz 。 Here foldername can be any string in any format.这里的foldername可以是任意格式的任意字符串。 It can have another or multiple - in it or anything else.它可以有另一个或多个-在它或其他任何东西中。

For example:例如：

hello-1234.tgz: This should give me FolderName as hello and Version as 1234 hello-1234.tgz：这应该给我FolderName as hello和Version as 1234
world-12345.tgz: This should give me FolderName as world and Version as 12345 world-12345.tgz：这应该给我FolderName作为world和Version作为12345
hello-21234-12345.tgz: This should give me FolderName as hello-21234 and Version as 12345 hello-21234-12345.tgz：这应该给我FolderName as hello-21234和Version as 12345
hello-21234-a-12345.tgz: This should give me FolderName as hello-21234-a and Version as 12345 hello-21234-a-12345.tgz：这应该给我FolderName as hello-21234-a和Version as 12345

Second format第二种格式

Other way is to have foldername-version-environment.tgz .其他方法是拥有foldername-version-environment.tgz 。 In this case also foldername can be any string in any format.在这种情况下， foldername也可以是任何格式的任何字符串。 Also environment string can only be dev , stage , prod and nothing else so I need to add check on that as well.此外，环境字符串只能是dev 、 stage 、 prod ，不能是其他任何东西，所以我也需要添加检查。

For example:例如：

hello-1234-dev.tgz: This should give me FolderName as hello and Version as 1234 hello-1234-dev.tgz：这应该给我FolderName as hello和Version as 1234
world-12345-stage.tgz: This should give me FolderName as world and Version as 12345 world-12345-stage.tgz：这应该给我FolderName作为world和Version作为12345
hello-21234-12345-prod.tgz: This should give me FolderName as hello-21234 and Version as 12345 hello-21234-12345-prod.tgz：这应该给我FolderName as hello-21234和Version as 12345
hello-21234-a-12345-prod.tgz: This should give me FolderName as hello-21234-a and Version as 12345 hello-21234-a-12345-prod.tgz：这应该给我FolderName as hello-21234-a和Version as 12345

Problem Statement问题陈述

So with the above two format - I need to extract FolderName and Version from my string.因此，对于上述两种格式——我需要从我的字符串中提取FolderName和Version 。 I tried with below regex but it doesn't work on my strings which are in second format but I want my code to work on both the formats.我尝试使用以下正则表达式，但它不适用于我的第二种格式的字符串，但我希望我的代码适用于这两种格式。

#sample example string which can be in first or second format
exampleString = hello-21234-12345-prod.tgz
build_found = re.search(r'[\d.-]+.tgz', exampleString)
version = build_found.group().replace(".tgz", "")
folderName = exampleString.split(version)[0]

What is wrong I am doing here?我在这里做错了什么？

Answer 1

I would use:我会用：

inp = "some text hello-21234-a-12345.tgz some more text"
parts = re.findall(r'\b([^\s-]+(?:-[^-]+)*)-(\d+)(?:-[^-]+)*\.\w+\b', inp)
print("FolderName: " + parts[0][0])
print("Version: " + parts[0][1])

This prints:这打印：

FolderName: hello-21234-a
Version: 12345

Answer 2

You need to use a regular expression that captures the components you're looking for within the string, then use .groups() to extract the captures.您需要使用正则表达式来捕获您在字符串中查找的组件，然后使用.groups()来提取捕获的内容。 This worked in my testing:这在我的测试中有效：

re.search(r'^(.+)-(\d+)\D*$', exampleString)

example in ipython: ipython 中的示例：

In [1]: import re

In [2]: s1 = 'hello-21234-12345-prod.tgz'

In [3]: s2 = 'hello-1234.tgz'

In [4]: re.search(r'^(.+)-(\d+)\D*$', s1).groups()
Out[4]: ('hello-21234', '12345')

In [5]: re.search(r'^(.+)-(\d+)\D*$', s2).groups()
Out[5]: ('hello', '1234')

The trick is the capture groups ( (...) ) within the regular expression r'^(.+)-(\d+)\D*$' .诀窍是正则表达式r'^(.+)-(\d+)\D*$'中的捕获组 ( (...) )。 There are two groups - it's actually easier to decode it by looking at the second capture group first, then the first.有两组 - 首先查看第二个捕获组，然后再查看第一个，实际上更容易对其进行解码。

The second part of the regex - r'(\d+)\D*$' matches the final series of \d digits.正则表达式的第二部分 - r'(\d+)\D*$'匹配最终的\d数字系列。 You know it is the final series of digits, because the \D*$ part will match and swallow up all non-digit characters up to the end of the string.你知道这是最后的数字系列，因为\D*$部分将匹配并吞掉所有非数字字符，直到字符串的末尾。

The first part of the regex - r'^(.+)-' matches everything before the second part.正则表达式的第一部分 - r'^(.+)-'匹配第二部分之前的所有内容。 It captures everything except the "-" character, and gives you the FolderName它捕获除"-"字符以外的所有内容，并为您提供 FolderName

Note that you'll need something a bit more complex if you have any digit characters in your environment or in the file ending (such as if you're using bzip2 compression)请注意，如果您的environment或文件结尾中有任何数字字符（例如您使用的是 bzip2 压缩），则需要一些更复杂的东西

Answer 3

Use groups to specify the different sections of the pattern.使用组来指定模式的不同部分。 You can name them for easier extraction later, too:您也可以命名它们以便以后更容易提取：

pattern = re.compile(r"(?P<FolderName>.+)-(?P<Version>\d+)(?:-(?P<Env>dev|stage|prod))?\.tgz")

m = pattern.match(ex)
print(m.groups())
# ('hello-21234', '12345', 'prod')
print(m.group('FolderName'), m.group('Version'), m.group('Env'))
# ('hello-21234', '12345', 'prod')

ex2 = "hello-21234-1234.tgz" # No environment
m = pattern.match(ex)
print(m.groups())
# ('hello-21234', '12345', None)
print(m.group('FolderName'), m.group('Version'), m.group('Env'))
# ('hello-21234', '12345', None)

Answer 4

See if this pattern works看看这个模式是否有效

import re
exampleString = 'hello-21234-12345-prod.tgz'
build_found = re.search(r'([\w-]+)-(\d+)-(dev|stage|prod)?', exampleString)

folder_name = build_found[1]
version = build_found[2]
environment = build_found[3]

print(folder_name)
print(version)
print(environment)

Output Output

hello-21234
12345
prod

Answer 5

Surely not the best approach, but here's one idea.当然不是最好的方法，但这是一个想法。

Start by determining whether you have the first or second case.首先确定您是第一种还是第二种情况。

-(dev|stage|prod)\.tgz$

This regex will determine whether or not you have case 1 or 2.此正则表达式将确定您是否有案例 1 或案例 2。

If it's case 1, you can extract the foldername with:如果是情况 1，您可以使用以下方法提取文件夹名称：

.*-

And you can extract the version with:您可以使用以下方法提取版本：

-\d+.tgz$

If it's case 2, you can extract the combined foldername/versionnumber with:如果是情况 2，您可以使用以下方法提取组合的文件夹名称/版本号：

.*-

From there, you can extract the foldername with (again):从那里，您可以（再次）提取文件夹名称：

.*-

And the version number with:以及版本号：

-\d+

在 python 中使用正则表达式从多种格式的字符串中提取字符串和数字？

问题描述

5 个解决方案

解决方案1
1 已采纳 2020-09-02 22:37:13

解决方案2
0 2020-09-02 22:34:25

解决方案3
0 2020-09-02 22:39:17

解决方案4
0 2020-09-02 22:39:55

解决方案5
0 2020-09-02 22:41:22

在 python 中使用正则表达式从多种格式的字符串中提取字符串和数字？

问题描述

5 个解决方案

解决方案1 1 已采纳 2020-09-02 22:37:13

解决方案2 0 2020-09-02 22:34:25

解决方案3 0 2020-09-02 22:39:17

解决方案4 0 2020-09-02 22:39:55

解决方案5 0 2020-09-02 22:41:22

解决方案1
1 已采纳 2020-09-02 22:37:13

解决方案2
0 2020-09-02 22:34:25

解决方案3
0 2020-09-02 22:39:17

解决方案4
0 2020-09-02 22:39:55

解决方案5
0 2020-09-02 22:41:22