简体   繁体   English

在 python 中使用正则表达式从多种格式的字符串中提取字符串和数字?

[英]Extract string and number from a string which is in multiple format using regex in python?

I am trying to parse a string using regex which is in particular format to get details out of it.我正在尝试使用特定格式的正则表达式来解析字符串,以从中获取详细信息。 I can have my string in two formats -我可以有两种格式的字符串 -

First format第一格式

One way is to have a foldername-version.tgz .一种方法是使用foldername-version.tgz Here foldername can be any string in any format.这里的foldername可以是任意格式的任意字符串。 It can have another or multiple - in it or anything else.它可以有另一个或多个-在它或其他任何东西中。

For example:例如:

  • hello-1234.tgz: This should give me FolderName as hello and Version as 1234 hello-1234.tgz:这应该给我FolderName as helloVersion as 1234
  • world-12345.tgz: This should give me FolderName as world and Version as 12345 world-12345.tgz:这应该给我FolderName作为worldVersion作为12345
  • hello-21234-12345.tgz: This should give me FolderName as hello-21234 and Version as 12345 hello-21234-12345.tgz:这应该给我FolderName as hello-21234Version as 12345
  • hello-21234-a-12345.tgz: This should give me FolderName as hello-21234-a and Version as 12345 hello-21234-a-12345.tgz:这应该给我FolderName as hello-21234-aVersion as 12345

Second format第二种格式

Other way is to have foldername-version-environment.tgz .其他方法是拥有foldername-version-environment.tgz In this case also foldername can be any string in any format.在这种情况下, foldername也可以是任何格式的任何字符串。 Also environment string can only be dev , stage , prod and nothing else so I need to add check on that as well.此外,环境字符串只能是devstageprod ,不能是其他任何东西,所以我也需要添加检查。

For example:例如:

  • hello-1234-dev.tgz: This should give me FolderName as hello and Version as 1234 hello-1234-dev.tgz:这应该给我FolderName as helloVersion as 1234
  • world-12345-stage.tgz: This should give me FolderName as world and Version as 12345 world-12345-stage.tgz:这应该给我FolderName作为worldVersion作为12345
  • hello-21234-12345-prod.tgz: This should give me FolderName as hello-21234 and Version as 12345 hello-21234-12345-prod.tgz:这应该给我FolderName as hello-21234Version as 12345
  • hello-21234-a-12345-prod.tgz: This should give me FolderName as hello-21234-a and Version as 12345 hello-21234-a-12345-prod.tgz:这应该给我FolderName as hello-21234-aVersion as 12345

Problem Statement问题陈述

So with the above two format - I need to extract FolderName and Version from my string.因此,对于上述两种格式——我需要从我的字符串中提取FolderNameVersion I tried with below regex but it doesn't work on my strings which are in second format but I want my code to work on both the formats.我尝试使用以下正则表达式,但它不适用于我的第二种格式的字符串,但我希望我的代码适用于这两种格式。

#sample example string which can be in first or second format
exampleString = hello-21234-12345-prod.tgz
build_found = re.search(r'[\d.-]+.tgz', exampleString)
version = build_found.group().replace(".tgz", "")
folderName = exampleString.split(version)[0]

What is wrong I am doing here?我在这里做错了什么?

I would use:我会用:

inp = "some text hello-21234-a-12345.tgz some more text"
parts = re.findall(r'\b([^\s-]+(?:-[^-]+)*)-(\d+)(?:-[^-]+)*\.\w+\b', inp)
print("FolderName: " + parts[0][0])
print("Version: " + parts[0][1])

This prints:这打印:

FolderName: hello-21234-a
Version: 12345

You need to use a regular expression that captures the components you're looking for within the string, then use .groups() to extract the captures.您需要使用正则表达式来捕获您在字符串中查找的组件,然后使用.groups()来提取捕获的内容。 This worked in my testing:这在我的测试中有效:

re.search(r'^(.+)-(\d+)\D*$', exampleString)

example in ipython: ipython 中的示例:

In [1]: import re

In [2]: s1 = 'hello-21234-12345-prod.tgz'

In [3]: s2 = 'hello-1234.tgz'

In [4]: re.search(r'^(.+)-(\d+)\D*$', s1).groups()
Out[4]: ('hello-21234', '12345')

In [5]: re.search(r'^(.+)-(\d+)\D*$', s2).groups()
Out[5]: ('hello', '1234')

The trick is the capture groups ( (...) ) within the regular expression r'^(.+)-(\d+)\D*$' .诀窍是正则表达式r'^(.+)-(\d+)\D*$'中的捕获组 ( (...) )。 There are two groups - it's actually easier to decode it by looking at the second capture group first, then the first.有两组 - 首先查看第二个捕获组,然后再查看第一个,实际上更容易对其进行解码。

The second part of the regex - r'(\d+)\D*$' matches the final series of \d digits.正则表达式的第二部分 - r'(\d+)\D*$'匹配最终的\d数字系列。 You know it is the final series of digits, because the \D*$ part will match and swallow up all non-digit characters up to the end of the string.你知道这是最后的数字系列,因为\D*$部分将匹配并吞掉所有非数字字符,直到字符串的末尾。

The first part of the regex - r'^(.+)-' matches everything before the second part.正则表达式的第一部分 - r'^(.+)-'匹配第二部分之前的所有内容。 It captures everything except the "-" character, and gives you the FolderName它捕获除"-"字符以外的所有内容,并为您提供 FolderName

Note that you'll need something a bit more complex if you have any digit characters in your environment or in the file ending (such as if you're using bzip2 compression)请注意,如果您的environment或文件结尾中有任何数字字符(例如您使用的是 bzip2 压缩),则需要一些更复杂的东西

Use groups to specify the different sections of the pattern.使用组来指定模式的不同部分。 You can name them for easier extraction later, too:您也可以命名它们以便以后更容易提取:

pattern = re.compile(r"(?P<FolderName>.+)-(?P<Version>\d+)(?:-(?P<Env>dev|stage|prod))?\.tgz")

m = pattern.match(ex)
print(m.groups())
# ('hello-21234', '12345', 'prod')
print(m.group('FolderName'), m.group('Version'), m.group('Env'))
# ('hello-21234', '12345', 'prod')

ex2 = "hello-21234-1234.tgz" # No environment
m = pattern.match(ex)
print(m.groups())
# ('hello-21234', '12345', None)
print(m.group('FolderName'), m.group('Version'), m.group('Env'))
# ('hello-21234', '12345', None)

See if this pattern works看看这个模式是否有效

import re
exampleString = 'hello-21234-12345-prod.tgz'
build_found = re.search(r'([\w-]+)-(\d+)-(dev|stage|prod)?', exampleString)

folder_name = build_found[1]
version = build_found[2]
environment = build_found[3]

print(folder_name)
print(version)
print(environment)

Output Output

hello-21234
12345
prod

Surely not the best approach, but here's one idea.当然不是最好的方法,但这是一个想法。

Start by determining whether you have the first or second case.首先确定您是第一种还是第二种情况。

-(dev|stage|prod)\.tgz$

This regex will determine whether or not you have case 1 or 2.此正则表达式将确定您是否有案例 1 或案例 2。

If it's case 1, you can extract the foldername with:如果是情况 1,您可以使用以下方法提取文件夹名称:

.*-

And you can extract the version with:您可以使用以下方法提取版本:

-\d+.tgz$

If it's case 2, you can extract the combined foldername/versionnumber with:如果是情况 2,您可以使用以下方法提取组合的文件夹名称/版本号:

.*-

From there, you can extract the foldername with (again):从那里,您可以(再次)提取文件夹名称:

.*-

And the version number with:以及版本号:

-\d+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM