简体   繁体   English

使用正则表达式从文本中提取键和值

[英]Extract Keys and Values from text using regular expressions

I have a big number of strings that I need to parse.我有大量需要解析的字符串。 These strings contain information that is put in key-value pairs.这些字符串包含放在键值对中的信息。

Sample input text:示例输入文本:

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illoinvente veritatis et quasi architecto beatae vitae dicta sunt explicabo。 Nemo enim: ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Nemo enim: ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt。 Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem。 Ut enim: ad minima veniam, *31.12.2012, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Ut enim: ad minima veniam, *31.12.2012, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil Molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur

Key information:关键信息:

  • A key starts either from the beginning of the string or after \.键从字符串的开头或\.
  • A key ends always with :一个键总是以:
  • The key is immediately followed by a value键后紧跟一个值
  • This value continues until the next key or until the last symbol in the string该值一直持续到下一个键或字符串中的最后一个符号
  • There are a multiple of key-value pairs, which I don't know有多个键值对,我不知道

Expected Output预计 Output

{
    "Nemo enim": "ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem",
    
    "Ut enim": "ad minima veniam, *31.12.2012, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur. Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur"
}

The regex that I have so far is ([üöä\w\s]*)\: (.*?)\.到目前为止,我拥有的正则表达式是([üöä\w\s]*)\: (.*?)\. . . Suffice it to say it doesn't provide the expected output.可以说它没有提供预期的 output。

This regex ([^:.]+):\s*([^:]+)(?=\.\s+|$) does the job.这个正则表达式([^:.]+):\s*([^:]+)(?=\.\s+|$)完成了这项工作。

Demo & explanation演示和解释

You can match the following regular expression, which saves the keys and values to capture groups 1 and 2.您可以匹配以下正则表达式,它保存键和值以捕获组 1 和 2。

r'(?<![^.]) *([^.]+?:) *((?:(?!\. ).)+)'

Start your engine!启动你的引擎! | | Python code Python代码

Python's regex engine performs the following operations. Python 的正则表达式引擎执行以下操作。

(?<![^.])    : negative lookbehind asserts current location is not
               preceded by a character other than '.'
\ *          : match 0+ spaces
(            : begin capture group 1
  [^.]+?     : match 1+ characters other than '.', lazily
  :          : match ':'
)            : end capture group 1
\ *          : match 0+ spaces
(            : begin capture group 2
  (?:        : begin non-capture group
    (?!\. )  : negative lookahead asserts current position is not
               followed by a period followed by a space
    .        : match any character other than a line terminator
  )+         : end non-capture group and execute 1+ times
)            : end capture group 2

This uses the tempered greedy token technique, which matches a series of individual characters that do not begin an unwanted string.这使用了缓和的贪婪令牌技术,该技术匹配一系列不以不需要的字符串开头的单个字符。 For example, if the string were "concatenate" , (?:(?:.cat).)+ would match the first three letters but not the second 'c' , so the match would be 'con' .例如,如果字符串是"concatenate"(?:(?:.cat).)+将匹配前三个字母但不匹配第二个'c' ,因此匹配将是'con'

Just for fun, here's a python, non-regex solution:只是为了好玩,这是一个 python,非正则表达式解决方案:

latin = """[the sample input text]"""
new_lat = latin.replace(":","xxx:").split('xxx')
for l in new_lat:
    if ":" in l:        
        curr_ind = new_lat.index(l)
        cur_brek = l.rfind('. ')
        prev_brek = new_lat[curr_ind-1].rfind('. ')
        stub = new_lat[curr_ind-1][prev_brek+2:]
        new_l = stub+l[:cur_brek]
        print(new_l)

Output is the two text blocks starting from the key. Output 是从键开始的两个文本块。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM