繁体   English   中英

如何从regex.findall的匹配中返回字典列表?

[英]How to return a list of dictionaries from the match of regex.findall?

我正在处理数百个文档,并且正在编写一个函数,该函数将查找特定的单词及其值并返回字典列表。

我正在专门寻找一条特定的信息(“城市”和引用它的数字)。 但是,在某些文档中,我有一个城市,而在另一些文档中,我可能有20个甚至一百个城市,因此我需要一些非常通用的东西。

一个文本示例(括号被这样弄乱了):

text = 'The territory of modern Hungary was for centuries inhabited by a succession of peoples, including Celts, Romans, Germanic tribes, Huns, West Slavs and the Avars. The foundations of the Hungarian state was established in the late ninth century AD by the Hungarian grand prince Árpád following the conquest of the Carpathian Basin. According to previous census City: Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). However etc etc'

要么

text2 = 'About medium-sized cities such as City: Eger (population was: 32,352). However etc etc'

使用正则表达式,我找到了要查找的字符串:

p = regex.compile(r'(?<=City).(.*?)(?=However)')
m = p.findall(text)

返回整个文本作为列表。

[' Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). ']

现在,这是我遇到的问题,我不知道如何进行。 我应该使用regex.findall还是regex.finditer?

考虑到文档中“城市”的数量各不相同,我想找一本字典清单。 如果以文本2运行,我将得到:

d = [{'cities': 'Eger', 'population': '32,352'}] 

如果我输入文字一:

d = [{'cities': 'Szeged', 'population': '104,867'}, {'cities': 'Miskolc': 'population': 109,841'}]

谢谢大家的帮助!

您可以将re.finditer与正则表达式一起使用,该正则表达式已使用re.finditer ()在匹配的文本上命名了捕获组(以键命名x.groupdict()以获取结果字典:

import re
text = 'The territory of modern Hungary was for centuries inhabited by a succession of peoples, including Celts, Romans, Germanic tribes, Huns, West Slavs and the Avars. The foundations of the Hungarian state was established in the late ninth century AD by the Hungarian grand prince Árpád following the conquest of the Carpathian Basin. According to previous census City: Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). However etc etc'
p = re.compile(r'City:\s*(.*?)However')
p2 = re.compile(r'(?P<city>\w+)\s*\([^()\d]*(?P<population>\d[\d,]*)')
m = p.search(text)
if m:
    print([x.groupdict() for x in p2.finditer(m.group(1))])

# => [{'population': '1,590,316', 'city': 'Budapest'}, {'population': '115,399', 'city': 'Debrecen'}, {'population': '104,867', 'city': 'Szeged'}, {'population': '109,841', 'city': 'Miskolc'}]

在线查看Python 3演示

第二个p2正则表达式是

(?P<city>\w+)\s*\([^()\d]*(?P<population>\d[\d,]*)

参见regex演示

这里,

  • (?P<city>\\w+) -组“ city”:1个以上的字符字符
  • \\s*\\( -0+空格和(
  • [^()\\d]* - ()和数字以外的任何0+个字符
  • (?P<population>\\d[\\d,]*) -组“人口”:一个数字,后跟0+个数字或/和逗号。

您可能会尝试在整个原始字符串上运行p2 regex(请参阅demo ),但可能会过度匹配。

@Wiktor一个很好的答案。 因为我花了一些时间在此上,所以我发布了答案。

d = [' Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). ']
oo = []
import re
for i in d[0].split(")"):
    jj = re.search("[0-9,]+", i)
    kk, *xx = i.split()
    if jj:
        oo.append({"cities": kk , "population": jj.group()})
print (oo)

#Result--> [{'cities': 'Budapest', 'population': '1,590,316'}, {'cities': 'Debrecen', 'population': '115,399'}, {'cities': 'Szeged', 'population': '104,867'}, {'cities': 'Miskolc', 'population': '109,841'}]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM