[英]Find and replace acronyms by full province names in SAS
我需要通过以下方式替换 SAS 数据集中的字符串:
OTTAWA ON
应替换为OTTAWA ONTARIO
WHATEVER QC
应替换为WHATEVER QUEBEC
等等。但是, HOUSE ON THE HILL
不应成为HOUSE ONTARIO THE HILL
。
也就是说,我想用ONTARIO
替换所有ON
实例,但前提是ON
作为字符串中的最后一个单词存在
您可以使用正则表达式来执行此操作。 根据您的描述,我认为以下应该有效。
myString = prxchange("s/(.*)( ON)$/$1 ONTARIO/",-1,strip(myString));
myString = prxchange("s/(.*)( QC)$/$1 QUEBEC/",-1,strip(myString));
使用单独的控制数据集来维护您想要的替换(邮政编码 -> 省)。
假设您仅对“令牌”(CA 邮政编码)作为最终“单词”执行转换,控制数据、数据和转换的示例如下:
data O_Canada(label="Our home and native land");
length postal $2 province $26 ;
input postal& province&; * suffix & means data fields separated by >1 space;
datalines;
ON Ontario
QC Quebec
NS Nova Scotia
NB New Brunswick
MB Manitoba
BC British Columbia
PE Prince Edward Island
SK Saskatchewan
AB Alberta
NL Newfoundland and Labrador
;
data cities(label='Some popular places');
length place $100;
input place $CHAR50.;
datalines;
CALGARY AB
VANCOUVER BC
WINNIPEG MB
MONCTON NB
ST. JONHS NL
HALIFAX NS
TORONTO ON
MONTREAL QC
SAKATOON SK
CHARLOTTETOWN PE
WHITEHORSE YT
YELLOWKNIFE NT
IQALUIT NU
GOLDMINE YUKON
;
data cities;
modify cities;
if _n_ = 1 then do;
length postal $3 province $26; * postal 1 bigger so scanned postal will not always match;
declare hash provinces(dataset:'O_Canada');
provinces.defineKey('postal');
provinces.defineData('province');
provinces.defineDone();
call missing(postal, province);
drop postal province;
end;
postal = scan(place,-1,' ');
if provinces.find() eq 0 then do;
* this inline replacement presumes all postal codes are 2 characters;
* -1 from length will replace starting from found postal;
substr(place,length(place)-1) = province; * inline replacement;
replace;
end;
run;
scan(myString, -1)
返回myString
中的最后一个单词, trim(myString)
删除尾随空格,因此在数据步骤中,这可以完成工作:
cutString = substr(myString, length(myString) - 2);
select scan(myString, -1)
when 'ON' myString = cutString || 'ONTARIO';
when 'QC' myString = cutString || 'QUEBEC';
end;
或在 SQL
select case scan(myString, -1)
when 'ON' then trim(myString) || 'TARIO'
when 'QC' then substr(myString, length() - 2) || 'QUEBEC'
else myString end as myString
from YOU_KNOW_BETTER_THAN_I_DO;
data GEOGRAPHY;
file datalines truncover;
informat geo $2. graphy $32.;
input geo $ graphy $;
datalines;
ON ONTARIO
QC QUEBEC
;
proc sql;
select whatever_you_want,
case graphy
when '' then myString
else substr(myString, length(myString) - length(geo)) || graphy
end as myString
from HAVE left joion GEOGRAPHY on scan(myString, -1) eq geo;
quit;
@Sonny,我认为正则表达式非常好。 还有@astel,还有另一种容易理解的方法:
data test;
InText = 'HOUSE ON THE HILL';
output;
InText = 'OTTAWA ON';
output;
run;
data _null_;
set test;
if cats(reverse(InText)) =: 'NO ' then OutText = tranwrd(InText,' ON',' ONTARIO');
put Intext = @30 OutText = ;
run;
output 将是
InText=HOUSE ON THE HILL OutText=
InText=OTTAWA ON OutText=OTTAWA ONTARIO
反转变量,以便您可以轻松判断新变量是否以NO
开头,这意味着原始变量以ON
结尾。 然后使用tranwrd()
函数进行替换工作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.