Extracting data using regex in python

Question

I have a string variable whose data looks something like this:

a:15:{s:6:"status";s:6:"Active";s:9:"checkdate";s:8:"20130807";s:11:"companyname";s:4:"test";s:11:"validdomain";s:19:"test";s:7:"md5hash";s:32:"501yd361fe10644ea1184412c3e89dce";s:7:"regdate";s:10:"2013-08-06";s:14:"registeredname";s:10:"TestName";s:9:"serviceid";s:1:"8";s:11:"nextduedate";s:10:"0000-00-00";s:12:"billingcycle";s:8:"OneTime";s:7:"validip";s:15:"xxx.xxx.xxx.xxx";s:14:"validdirectory";s:5:"/root";s:11:"productname";s:20:"SomeProduct";s:5:"email";s:19:"testmail@test.com";s:9:"productid";s:1:"1";}

I am trying to extract the quoted data into a dictionary as a key-value pair like so:

{"status":"Active","checkdate":20130807,.............}

I tried extracting it using the following:

tempkeyresults = re.findall('"(.*?)"([^"]+)</\\1>', localdata, flags=re.IGNORECASE)

I'm quite new to regex and I assume what I am trying to query translates to " find and extract all data between " and " and extract it before the next "... " However, this returns and empty string([]). Could someone tell me where I am wrong?

Thanks in advance

Answer 1

How about this?

>>> import re
>>> s = 'a:15:{s:6:"status";s:6:"Active";s:9:"checkdate";s:8:"20130807";s:11:"companyname";s:4:"test";s:11:"validdomain";s:19:"test";s:7:"md5hash";s:32:"501yd361fe10644ea1184412c3e89dce";s:7:"regdate";s:10:"2013-08-06";s:14:"registeredname";s:10:"TestName";s:9:"serviceid";s:1:"8";s:11:"nextduedate";s:10:"0000-00-00";s:12:"billingcycle";s:8:"OneTime";s:7:"validip";s:15:"xxx.xxx.xxx.xxx";s:14:"validdirectory";s:5:"/root";s:11:"productname";s:20:"SomeProduct";s:5:"email";s:19:"testmail@test.com";s:9:"productid";s:1:"1";}'
>>> results = re.findall('"(\w+)"', s)
>>> dict(zip(*[iter(results)] * 2))
{'status': 'Active', 'companyname': 'test', 'validdomain': 'test', 'md5hash': '501yd361fe10644ea1184412c3e89dce', 'regdate': 'registeredname', 'TestName': 'serviceid', 'email': 'productid', 'billingcycle': 'OneTime', 'validip': 'validdirectory', '8': 'nextduedate', 'productname': 'SomeProduct', 'checkdate': '20130807'}

\\w means "any word character" (letters, numbers, regardless of case, and underscore (_))
+ means 1 or more.
dict(zip(*[iter(results)] * 2)) is very well explained in this answer

Answer 2

This one, find all the words surrounding by quotes and then slices the list to mapping:

>>> res = re.findall('"(\w+)"', s)
>>> i = iter(res)
>>> dict(zip(*[i]*2))
{'status': 'Active', 'companyname': 'test', 'validdomain': 'test', 'md5hash': '501yd361fe10644ea1184412c3e89dce', 'regdate': 'registeredname', 'TestName': 'serviceid', 'email': 'productid', 'billingcycle': 'OneTime', 'validip': 'validdirectory', '8': 'nextduedate', 'productname': 'SomeProduct', 'checkdate': '20130807'}

Or use this one. This will use regex to find all the pairs(adjacent two):

>>> res = re.findall('"(\w+)"(?:.*?)"(\w+)"', s)
>>> res
[('status', 'Active'), ('checkdate', '20130807'), ('companyname', 'test'), ('validdomain', 'test'), ('md5hash', '501yd361fe10644ea1184412c3e89dce'), ('regdate', 'registeredname'), ('TestName', 'serviceid'), ('8', 'nextduedate'), ('billingcycle', 'OneTime'), ('validip', 'validdirectory'), ('productname', 'SomeProduct'), ('email', 'productid')]
>>> dict(res)
{'status': 'Active', 'companyname': 'test', 'validdomain': 'test', 'md5hash': '501yd361fe10644ea1184412c3e89dce', 'regdate': 'registeredname', 'TestName': 'serviceid', 'email': 'productid', 'billingcycle': 'OneTime', 'validip': 'validdirectory', '8': 'nextduedate', 'productname': 'SomeProduct', 'checkdate': '20130807'}

Answer 3

You can do it without regular expressions.

parts = s.split('"')[1::2] # get all quoted text in a list
keys, values = parts[::2], parts[1::2] # take even and odd items (keys, values)
results = dict(zip(keys, values)) # turn it into a dict

results:

{'status': 'Active', 'companyname': 'test', 'validdomain': 'test', 'productid': '1', 'md5hash': '501yd361fe10644ea1184412c3e89dce', 'regdate': '2013-08-06', 'registeredname': 'TestName', 'email': 'testmail@test.com', 'serviceid': '8', 'nextduedate': '0000-00-00', 'billingcycle': 'OneTime', 'validip': 'xxx.xxx.xxx.xxx', 'productname': 'SomeProduct', 'checkdate': '20130807', 'validdirectory': '/root'}

Extracting data using regex in python

Question

3 answers

solution1
2 2013-08-07 07:43:41

solution2
1 ACCPTED 2013-08-07 07:44:47

solution3
1 2013-08-07 08:13:08

Extracting data using regex in python

Question

3 answers

solution1 2 2013-08-07 07:43:41

solution2 1 ACCPTED 2013-08-07 07:44:47

solution3 1 2013-08-07 08:13:08

solution1
2 2013-08-07 07:43:41

solution2
1 ACCPTED 2013-08-07 07:44:47

solution3
1 2013-08-07 08:13:08