简体   繁体   中英

Extract Keys and Values from text using regular expressions

I have a big number of strings that I need to parse. These strings contain information that is put in key-value pairs.

Sample input text:

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim: ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim: ad minima veniam, *31.12.2012, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur

Key information:

  • A key starts either from the beginning of the string or after \.
  • A key ends always with :
  • The key is immediately followed by a value
  • This value continues until the next key or until the last symbol in the string
  • There are a multiple of key-value pairs, which I don't know

Expected Output

{
    "Nemo enim": "ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem",
    
    "Ut enim": "ad minima veniam, *31.12.2012, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur. Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur"
}

The regex that I have so far is ([üöä\w\s]*)\: (.*?)\. . Suffice it to say it doesn't provide the expected output.

This regex ([^:.]+):\s*([^:]+)(?=\.\s+|$) does the job.

Demo & explanation

You can match the following regular expression, which saves the keys and values to capture groups 1 and 2.

r'(?<![^.]) *([^.]+?:) *((?:(?!\. ).)+)'

Start your engine! | Python code

Python's regex engine performs the following operations.

(?<![^.])    : negative lookbehind asserts current location is not
               preceded by a character other than '.'
\ *          : match 0+ spaces
(            : begin capture group 1
  [^.]+?     : match 1+ characters other than '.', lazily
  :          : match ':'
)            : end capture group 1
\ *          : match 0+ spaces
(            : begin capture group 2
  (?:        : begin non-capture group
    (?!\. )  : negative lookahead asserts current position is not
               followed by a period followed by a space
    .        : match any character other than a line terminator
  )+         : end non-capture group and execute 1+ times
)            : end capture group 2

This uses the tempered greedy token technique, which matches a series of individual characters that do not begin an unwanted string. For example, if the string were "concatenate" , (?:(?:.cat).)+ would match the first three letters but not the second 'c' , so the match would be 'con' .

Just for fun, here's a python, non-regex solution:

latin = """[the sample input text]"""
new_lat = latin.replace(":","xxx:").split('xxx')
for l in new_lat:
    if ":" in l:        
        curr_ind = new_lat.index(l)
        cur_brek = l.rfind('. ')
        prev_brek = new_lat[curr_ind-1].rfind('. ')
        stub = new_lat[curr_ind-1][prev_brek+2:]
        new_l = stub+l[:cur_brek]
        print(new_l)

Output is the two text blocks starting from the key.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM