简体   繁体   English

正则表达式查找两个单词之间的多个值

[英]Regex to find multiple values between two words

I'm trying to retrieve multiple values between two words, delimiting a specific part of the text.我正在尝试检索两个单词之间的多个值,以分隔文本的特定部分。 The pattern is this:模式是这样的:

(^\d\d\d\d)\D+

I've tried a lot of regex, but I was not successful.我尝试了很多正则表达式,但没有成功。

Below is an example of an attempt that is not working, because it is only returning the first value.下面是一个不工作的尝试示例,因为它只返回第一个值。

Livro[\s\S]*?(^\d\d\d\d)\D+[\s\S]*(?=em moeda corrente)

The text where I am trying to apply the regex is as follows.我尝试应用正则表达式的文本如下。 In bold the values I want to retrieve.我要检索的值以粗体显示。

UPDATE I changed the example because sometimes the first four digits don't have the '/ dd' after them.更新我更改了示例,因为有时前四位数字后面没有'/ dd'。


CERTIDAO DE DIVIDA ATIVA Nr:XXXXXXXXX 6A CERTIDAO DE DIVIDA ATIVA 编号:XXXXXXXXX 6A

Inscrigao Pessoa Receita Inscrigao Pessoa Receita

5588 39783 03 -1SS VARIAVEL 5588 39783 03 -1SS 变量

Dispositivo Legal do Principal 03 - Artigos 55, 57, 58, 59, 63, 64, 151 e 153, no subiteém 14.01 da Lista de Servicos e na Tabela 03, inciso Ill, da Lei Complementar n° 12/1994, com alteragé6es dadas pelas Leis Complementares Municipais n° 56/1997, 116/2000, 196/2002, 217/2003, 270/2006, 314/2008, 320/2008, 399/2011 e 502/2015, 538/2017 e artigo 4° da Lei Complementar n° 4124/2000. Dispositivo Legal do Principal 03 - Artigos 55, 57, 58, 59, 63, 64, 151 e 153, no subiteém 14.01 da Lista de Servicos e na Tabela 03, inciso Ill, da Lei Complementar n° 12/1994, Z4D236D9A2D102C50FE6AD1C650DA4 pelas Leis Complementares Municipais n° 56/1997, 116/2000, 196/2002, 217/2003, 270/2006, 314/2008, 320/2008, 399/2011 e 502/2015, 538/2017 e artigo 4° da Lei Complementar n° 4124/2000。

Livro: 14 _ Folha: 17583 a Data: 18/04/201 9 - a Doc.--Receita -Origam do Débite Principal. Livro: 14 _ Folha: 17583 a Data: 18/04/201 9 - a Doc.--Receita -Origam do Débite Principal。 Corregao.科雷高。 “AcréscimoD.A Multa == dures. “AcréscimoD.A Multa == dures。 Total全部的

2016 03 = =ISS VARIAVEL 36,80 6,47 2,16 4,33 20,33 70,09 2016 03 = =ISS VARIAVEL 36,80 6,47 2,16 4,33 20,33 70,09

2016 03 ISS VARIAVEL 116,00 20,38 6,82 13,64 62,74 219,58 2016 03 ISS VARIAVEL 116,00 20,38 6,82 13,64 62,74 219,58

2016 03 ISS VARIAVEL 340,00 59,74 19,99 39,97 179,88 639,58 2016 03 ISS VARIAVEL 340,00 59,74 19,99 39,97 179,88 639,58

2016 G3 ISS VARIAVEL 246,40 43,29 14,48 28,97 127,46 466,60 2016 G3 ISS VARIAVEL 246,40 43,29 14,48 28,97 127,46 466,60

2016 /10 O03 ISS VARIAVEL 56,00 9,84 3,29 6,59 28,31 104,03 2016 /10 O03 ISS VARIAVEL 56,00 9,84 3,29 6,59 28,31 104,03

2016 /11 03 ISS VARIAVEL 623,84 109,61 36,67 73,35 308,05 1.161 52 2016 /11 03 ISS VARIAVEL 623,84 109,61 36,67 73,35 308,05 1.161 52

2016 /12 03 ISS VARIAVEL 20,40 3,58 4,20 2,40 * 9,83 37,41 2016 /12 03 ISS VARIAVEL 20,40 3,58 4,20 2,40 * 9,83 37,41

TOTAL, em moeda corrente, atualizado até: 23/06/2020 2.682,81 A) Atualizag&o Monetaria: artigos 153, paragrafo 1°, 200, | TOTAL, em moeda corrente, atualizado até: 23/06/2020 2.682,81 A) Atualizag&o Monetaria: artigos 153, paragrafo 1°, 200, | e 209, todos da Lei Complementar Municipal n° 12/94; e 209, todos da Lei Complementar Municipal n° 12/94; artigo 4°, da Lei Complementar Municipal n° 1124/2000. artigo 4°,大雷 Complementar Municipal n° 1124/2000。


I'm testing here https://regex101.com/r/tzgGVT/2 ( updated )我在这里测试https://regex101.com/r/tzgGVT/2更新

Thanks in advance;-)提前致谢;-)

As an alternative, you could make use of the regex PyPi module using the \G anchor to get the values in bold in capture group 1.作为替代方案,您可以使用正则表达式 PyPi 模块,使用\G锚来获取捕获组 1 中的粗体值。

(?:^Livro.*(?:\r?\n(?!\d{4}/\d).*)*\r?\n|\G)(\d{4})/\d+.*\r?\n(?=(?:\d{4}/\d.*\r?\n)*.*?\bem moeda corrente\b)

In parts在零件

  • (?: Non capture group (?:非捕获组
    • ^Livro.*(?:\r?\n(?.\d{4}/\d)?*)*\r?\n Match start and lines that don't start with 4 digits / and digit ^Livro.*(?:\r?\n(?.\d{4}/\d)?*)*\r?\n匹配不以 4 位/和 digit 开头的开始和行
    • | Or或者
    • \G Assert the position at the end of the previous match \G在上一场比赛结束时断言 position
  • ) Close non capture group )关闭非捕获组
  • (\d{4}) Capture group 1 , match 4 digits (\d{4})捕获组 1 ,匹配 4 个数字
  • /\d+.*\r?\n Match / and 1+ digits followed by the rest of the line /\d+.*\r?\n匹配/和 1+ 个数字,后跟该行的 rest
  • (?= Positive lookahead, assert what is on the right is (?=正向前瞻,断言右边是
    • (?:\d{4}/\d.*\r?\n)* Repeat 0+ times matching a line the starts with 4 digits / and digit (?:\d{4}/\d.*\r?\n)*重复 0+ 次匹配以 4 个数字/和 digit 开头的行
    • .*?\bem moeda corrente\b Match a line that contains em moeda corrente .*?\bem moeda corrente\b匹配包含em moeda corrente的行
  • ) Close positive lookahead )关闭正向前瞻

Regex demo |正则表达式演示| Python demo Python 演示

Example code

import regex

pattern = r"(?:^Livro.*(?:\r?\n(?!\d{4}/\d).*)*\r?\n|\G)(\d{4})/\d+.*\r?\n(?=(?:\d{4}/\d.*\r?\n)*.*?\bem moeda corrente\b)"

print(regex.findall(pattern, s, regex.MULTILINE))

Output Output

['2016', '2016', '2016', '2016', '2016', '2016', '2016']

I think the complete pattern you are looking for is a year/month where year is four digits and month is 1 or 2 digits, not more , so followed by a [space].我认为您正在寻找的完整模式是年/月,其中年是四位数字,月是 1 或 2 位数字,而不是更多,因此后跟一个 [空格]。 In regular expression form:以正则表达式形式:

import re

found = re.findall(r'(\d\d\d\d)/\d\d? ', text)
print(found)

Outputs:输出:

['2016', '2016', '2016', '2016', '2016', '2016', '2016']

Or if you want the entire line for which the expression matches, just leave out the parentheses:或者,如果您想要表达式匹配的整行,只需省略括号:

found = re.findall(r'\d\d\d\d/\d\d? .*', text)
for line in found: print(line)

Outputs:输出:

2016/6 03 = =ISS VARIAVEL 36,80 6,47 2,16 4,33 20,33 70,09
2016/7 03 ISS VARIAVEL 116,00 20,38 6,82 13,64 62,74 219,58
2016/8 03 ISS VARIAVEL 340,00 59,74 19,99 39,97 179,88 639,58
2016/9 G3 ISS VARIAVEL 246,40 43,29 14,48 28,97 127,46 466,60
2016/10 O03 ISS VARIAVEL 56,00 9,84 3,29 6,59 28,31 104,03
2016/11 03 ISS VARIAVEL 623,84 109,61 36,67 73,35 308,05 1.161 52
2016/12 03 ISS VARIAVEL 20,40 3,58 4,20 2,40 * 9,83 37,41

Or split the text in two by a marker string (eg 'Livro') and search in the second part for 4 digits immediately after a newline character (is beginning of a line)或通过标记字符串(例如“Livro”)将文本一分为二,并在第二部分中搜索紧跟换行符后的 4 位数字(是行首)

parts = text.split('Livro')
found = re.findall(r'\n(\d\d\d\d)', parts[1])

You need to make use of the findall() method of the re module.您需要使用re模块的findall()方法。 Moreover, the following path I think is the one that you're looking for: '\n(\d+)\/(?=[\s\S]+ em moeda corrente)'此外,我认为以下路径是您正在寻找的路径: '\n(\d+)\/(?=[\s\S]+ em moeda corrente)'

>>> re.findall('\n(\d+)\/(?=[\s\S]+ em moeda corrente)', text)
['2016', '2016', '2016', '2016', '2016', '2016', '2016']

Try it out at: https://regex101.com/r/RoOB0t/2试试看: https://regex101.com/r/RoOB0t/2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM