I am looking to extract a list of tuples from the following string:
text='''Consumer Price Index:
+0.2% in Sep 2020
Unemployment Rate:
+7.9% in Sep 2020
Producer Price Index:
+0.4% in Sep 2020
Employment Cost Index:
+0.5% in 2nd Qtr of 2020
Productivity:
+10.1% in 2nd Qtr of 2020
Import Price Index:
+0.3% in Sep 2020
Export Price Index:
+0.6% in Sep 2020'''
I am using 'import re' for the process.
The output should be something like: [('Consumer Price Index', '+0.2%', 'Sep 2020'), ...]
I want to use a re.findall function that produces the above output, so far I have this:
re.findall(r"(:\Z)\s+(%\Z+)(\Ain )", text)
Where I am identifying the characters prior to ':', then the characters prior to '%' and then the characters after 'in'.
I'm really just clueless on how to continue. Any help would be appreciated. Thanks!
You can use
re.findall(r'(\S.*):\n\s*(\+?\d[\d.]*%)\s+in\s+(.*)', text)
# => [('Consumer Price Index', '+0.2%', 'Sep 2020'), ('Unemployment Rate', '+7.9%', 'Sep 2020'), ('Producer Price Index', '+0.4%', 'Sep 2020'), ('Employment Cost Index', '+0.5%', '2nd Qtr of 2020'), ('Productivity', '+10.1%', '2nd Qtr of 2020'), ('Import Price Index', '+0.3%', 'Sep 2020'), ('Export Price Index', '+0.6%', 'Sep 2020')]
See the regex demo and the Python demo .
Details
(\\S.*)
- Group 1: a non-whitespace char followed with any zero or more chars other than line break chars as many as possible :
- a colon \\n
- a newline \\s*
- 0 or more whitespaces (\\+?\\d[\\d.]*%)
- Group 2: optional +
, a digit, zero or more digits/dots, and a %
\\s+in\\s+
- in
enclosed with 1+ whitespaces (.*)
- Group 3: any zero or more chars other than line break chars as many as possible Regex is not a good way to approach this. It gets hard to read and maintain very fast. It can be done much cleaner by using pythons string functions:
list_of_lines = [
line.strip() # remove trailing and leading whitespace
for line in text.split("\n") # split up the text into lines
if line # filter out the empty lines
]
list_of_lines
is now:
['Consumer Price Index:', '+0.2% in Sep 2020', 'Unemployment Rate:', '+7.9% in Sep 2020', 'Producer Price Index:', '+0.4% in Sep 2020', 'Employment Cost Index:', '+0.5% in 2nd Qtr of 2020', 'Productivity:', '+10.1% in 2nd Qtr of 2020', 'Import Price Index:', '+0.3% in Sep 2020', 'Export Price Index:', '+0.6% in Sep 2020']
now all we have to do is build tuples from pairs of elements of this list.
def pairwise(iterable):
"s -> (s0, s1), (s2, s3), (s4, s5), ..."
a = iter(iterable)
return zip(a, a)
(from here )
Now we can get our desired output:
print(pairwise(list_of_lines))
[('Consumer Price Index:', '+0.2% in Sep 2020'), ('Unemployment Rate:', '+7.9% in Sep 2020'), ('Producer Price Index:', '+0.4% in Sep 2020'), ('Employment Cost Index:', '+0.5% in 2nd Qtr of 2020'), ('Productivity:', '+10.1% in 2nd Qtr of 2020'), ('Import Price Index:', '+0.3% in Sep 2020'), ('Export Price Index:', '+0.6% in Sep 2020')]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.