简体   繁体   English

如何从regex.findall的匹配中返回字典列表?

[英]How to return a list of dictionaries from the match of regex.findall?

I'm working on several hundreds of documents and I'm writing a function that will find specific words and its values and returns a list of dictionaries. 我正在处理数百个文档,并且正在编写一个函数,该函数将查找特定的单词及其值并返回字典列表。

I'm looking specifically for a piece of specific information ('city' and the number that refers to it). 我正在专门寻找一条特定的信息(“城市”和引用它的数字)。 However, in some documents, I have one city, and in others, I might have twenty or even one hundred, so I need something very generic. 但是,在某些文档中,我有一个城市,而在另一些文档中,我可能有20个甚至一百个城市,因此我需要一些非常通用的东西。

A text example (the parenthesis are messed up like this): 一个文本示例(括号被这样弄乱了):

text = 'The territory of modern Hungary was for centuries inhabited by a succession of peoples, including Celts, Romans, Germanic tribes, Huns, West Slavs and the Avars. The foundations of the Hungarian state was established in the late ninth century AD by the Hungarian grand prince Árpád following the conquest of the Carpathian Basin. According to previous census City: Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). However etc etc'

or 要么

text2 = 'About medium-sized cities such as City: Eger (population was: 32,352). However etc etc'

Using regex I found the string that I'm looking for: 使用正则表达式,我找到了要查找的字符串:

p = regex.compile(r'(?<=City).(.*?)(?=However)')
m = p.findall(text)

Which returns the whole text as a list. 返回整个文本作为列表。

[' Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). ']

Now, this is where I'm stuck and I don't know how to proceed. 现在,这是我遇到的问题,我不知道如何进行。 Should I use regex.findall or regex.finditer? 我应该使用regex.findall还是regex.finditer?

Considering that the amount of 'cities' varies in the documents, I would like to get a list of dictionaries back. 考虑到文档中“城市”的数量各不相同,我想找一本字典清单。 If I run in text 2, I would get: 如果以文本2运行,我将得到:

d = [{'cities': 'Eger', 'population': '32,352'}] 

If I run in text one: 如果我输入文字一:

d = [{'cities': 'Szeged', 'population': '104,867'}, {'cities': 'Miskolc': 'population': 109,841'}]

I really appreciate any help, guys! 谢谢大家的帮助!

You may use re.finditer with a regex having named capturing groups (named after your keys) on the matched text with x.groupdict() to get a dictionary of results: 您可以将re.finditer与正则表达式一起使用,该正则表达式已使用re.finditer ()在匹配的文本上命名了捕获组(以键命名x.groupdict()以获取结果字典:

import re
text = 'The territory of modern Hungary was for centuries inhabited by a succession of peoples, including Celts, Romans, Germanic tribes, Huns, West Slavs and the Avars. The foundations of the Hungarian state was established in the late ninth century AD by the Hungarian grand prince Árpád following the conquest of the Carpathian Basin. According to previous census City: Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). However etc etc'
p = re.compile(r'City:\s*(.*?)However')
p2 = re.compile(r'(?P<city>\w+)\s*\([^()\d]*(?P<population>\d[\d,]*)')
m = p.search(text)
if m:
    print([x.groupdict() for x in p2.finditer(m.group(1))])

# => [{'population': '1,590,316', 'city': 'Budapest'}, {'population': '115,399', 'city': 'Debrecen'}, {'population': '104,867', 'city': 'Szeged'}, {'population': '109,841', 'city': 'Miskolc'}]

See the Python 3 demo online . 在线查看Python 3演示

The second p2 regex is 第二个p2正则表达式是

(?P<city>\w+)\s*\([^()\d]*(?P<population>\d[\d,]*)

See the regex demo . 参见regex演示

Here, 这里,

  • (?P<city>\\w+) - Group "city": 1+ word chars (?P<city>\\w+) -组“ city”:1个以上的字符字符
  • \\s*\\( - 0+ whitespaces and ( \\s*\\( -0+空格和(
  • [^()\\d]* - any 0+ chars other than ( and ) and digits [^()\\d]* - ()和数字以外的任何0+个字符
  • (?P<population>\\d[\\d,]*) - Group "population": a digit followed with 0+ digits or/and commas (?P<population>\\d[\\d,]*) -组“人口”:一个数字,后跟0+个数字或/和逗号。

You might try to run the p2 regex on the whole original string (see demo ), but it may overmatch. 您可能会尝试在整个原始字符串上运行p2 regex(请参阅demo ),但可能会过度匹配。

A very good answer by @Wiktor. @Wiktor一个很好的答案。 Since I spend some time on this, I am posting my answer. 因为我花了一些时间在此上,所以我发布了答案。

d = [' Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). ']
oo = []
import re
for i in d[0].split(")"):
    jj = re.search("[0-9,]+", i)
    kk, *xx = i.split()
    if jj:
        oo.append({"cities": kk , "population": jj.group()})
print (oo)

#Result--> [{'cities': 'Budapest', 'population': '1,590,316'}, {'cities': 'Debrecen', 'population': '115,399'}, {'cities': 'Szeged', 'population': '104,867'}, {'cities': 'Miskolc', 'population': '109,841'}]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM