[英]Cleaning spreadsheet using regex
I would like to remove all but the statistic in every entry of the following: 我想删除以下每个条目中的所有统计信息:
#ChangeColumnFullTimeGraduatesEmployedAtGraduation:74.3% #ChangeColumnAverageStartingSalaryAndBonus:$134,360 3.4 #ChangeColumnFullTimeGraduatesEmployedThreeMonthsAfterGraduation:81.4% #ChangeColumnPeerAssessmentScoreOutOf5.:4.3
#ChangeColumnFullTimeGraduatesEmployedAtGraduation:82.0% #ChangeColumnAverageStartingSalaryAndBonus:$127,368 3.29 #ChangeColumnFullTimeGraduatesEmployedThreeMonthsAfterGraduation:89.8% #ChangeColumnPeerAssessmentScoreOutOf5.:4.1
#ChangeColumnFullTimeGraduatesEmployedAtGraduation:80.7% #ChangeColumnAverageStartingSalaryAndBonus:$123,177 3.4 #ChangeColumnFullTimeGraduatesEmployedThreeMonthsAfterGraduation:92.5% #ChangeColumnPeerAssessmentScoreOutOf5.:4.0
I've been trying to use regular expressions (regex). 我一直在尝试使用正则表达式(regex)。 Based on the fact that the desired final output consists of no more than a number an a percent sign / $ sign, this is what I cobbled together: 基于所需的最终输出由不超过一个数字百分号/ $号组成的事实,这就是我拼凑的内容:
import re
import csv
with(open('sheet.csv','rU')) as f:
for row in f:
re.sub([^0-9\$\%],'',row)
which returns this syntax error: 返回以下语法错误:
re.sub([^0-9\$\%],'',row)
Regexes are parsed from strings, use a string as argument to re.sub, ie 正则表达式是从字符串中解析出来的,使用字符串作为re.sub的参数,即
>>> re.sub(r'[^0-9\$\%]','',row)
or maybe you want to split instead: 或者您可能想拆分:
>>> [c for c in re.split(r'[^0-9\$\%\.]',row) if c]
['74.3%', '$134', '360', '3.4', '81.4%', '5.', '4.3']
It is actually still not correct, as you have numbers in your column labels. 实际上,它仍然不正确,因为列标签中有数字。 If your input looks exactly like your example, something like this might work better: 如果您的输入看起来完全像您的示例,则类似的方法可能会更好:
re.split(r'#[^:]+:|[ ,]',row)
'74.3%', '$134', '360', '3.4', '81.4%', '4.3'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.