[英]Extract Keys and Values from text using regular expressions
我有大量需要解析的字符串。 這些字符串包含放在鍵值對中的信息。
Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illoinvente veritatis et quasi architecto beatae vitae dicta sunt explicabo。 Nemo enim: ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt。 Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem。 Ut enim: ad minima veniam, *31.12.2012, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil Molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur
\.
:
{
"Nemo enim": "ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem",
"Ut enim": "ad minima veniam, *31.12.2012, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur. Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur"
}
到目前為止,我擁有的正則表達式是([üöä\w\s]*)\: (.*?)\.
. 可以說它沒有提供預期的 output。
這個正則表達式([^:.]+):\s*([^:]+)(?=\.\s+|$)
完成了這項工作。
您可以匹配以下正則表達式,它保存鍵和值以捕獲組 1 和 2。
r'(?<![^.]) *([^.]+?:) *((?:(?!\. ).)+)'
Python 的正則表達式引擎執行以下操作。
(?<![^.]) : negative lookbehind asserts current location is not
preceded by a character other than '.'
\ * : match 0+ spaces
( : begin capture group 1
[^.]+? : match 1+ characters other than '.', lazily
: : match ':'
) : end capture group 1
\ * : match 0+ spaces
( : begin capture group 2
(?: : begin non-capture group
(?!\. ) : negative lookahead asserts current position is not
followed by a period followed by a space
. : match any character other than a line terminator
)+ : end non-capture group and execute 1+ times
) : end capture group 2
這使用了緩和的貪婪令牌技術,該技術匹配一系列不以不需要的字符串開頭的單個字符。 例如,如果字符串是"concatenate"
, (?:(?:.cat).)+
將匹配前三個字母但不匹配第二個'c'
,因此匹配將是'con'
。
只是為了好玩,這是一個 python,非正則表達式解決方案:
latin = """[the sample input text]"""
new_lat = latin.replace(":","xxx:").split('xxx')
for l in new_lat:
if ":" in l:
curr_ind = new_lat.index(l)
cur_brek = l.rfind('. ')
prev_brek = new_lat[curr_ind-1].rfind('. ')
stub = new_lat[curr_ind-1][prev_brek+2:]
new_l = stub+l[:cur_brek]
print(new_l)
Output 是從鍵開始的兩個文本塊。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.