Cleaning spreadsheet using regex

Question

I would like to remove all but the statistic in every entry of the following:

#ChangeColumnFullTimeGraduatesEmployedAtGraduation:74.3%    #ChangeColumnAverageStartingSalaryAndBonus:$134,360 3.4 #ChangeColumnFullTimeGraduatesEmployedThreeMonthsAfterGraduation:81.4%  #ChangeColumnPeerAssessmentScoreOutOf5.:4.3
#ChangeColumnFullTimeGraduatesEmployedAtGraduation:82.0%    #ChangeColumnAverageStartingSalaryAndBonus:$127,368 3.29    #ChangeColumnFullTimeGraduatesEmployedThreeMonthsAfterGraduation:89.8%  #ChangeColumnPeerAssessmentScoreOutOf5.:4.1
#ChangeColumnFullTimeGraduatesEmployedAtGraduation:80.7%    #ChangeColumnAverageStartingSalaryAndBonus:$123,177 3.4 #ChangeColumnFullTimeGraduatesEmployedThreeMonthsAfterGraduation:92.5%  #ChangeColumnPeerAssessmentScoreOutOf5.:4.0

I've been trying to use regular expressions (regex). Based on the fact that the desired final output consists of no more than a number an a percent sign / $ sign, this is what I cobbled together:

import re
import csv

with(open('sheet.csv','rU')) as f:

    for row in f:
        re.sub([^0-9\$\%],'',row)

which returns this syntax error:

re.sub([^0-9\$\%],'',row)

Answer 1

Regexes are parsed from strings, use a string as argument to re.sub, ie

>>> re.sub(r'[^0-9\$\%]','',row)

or maybe you want to split instead:

>>> [c for c in re.split(r'[^0-9\$\%\.]',row) if c]
['74.3%', '$134', '360', '3.4', '81.4%', '5.', '4.3']

It is actually still not correct, as you have numbers in your column labels. If your input looks exactly like your example, something like this might work better:

re.split(r'#[^:]+:|[ ,]',row)
'74.3%', '$134', '360', '3.4', '81.4%', '4.3'

Cleaning spreadsheet using regex

Question

1 answers

solution1
4 ACCPTED 2013-07-25 21:01:53

Cleaning spreadsheet using regex

Question

1 answers

solution1 4 ACCPTED 2013-07-25 21:01:53

solution1
4 ACCPTED 2013-07-25 21:01:53