繁体   English   中英

在 SAS 中查找和替换全省名称的首字母缩略词

[英]Find and replace acronyms by full province names in SAS

我需要通过以下方式替换 SAS 数据集中的字符串:

  • OTTAWA ON应替换为OTTAWA ONTARIO
  • WHATEVER QC应替换为WHATEVER QUEBEC

等等。但是, HOUSE ON THE HILL不应成为HOUSE ONTARIO THE HILL

也就是说,我想用ONTARIO替换所有ON实例,但前提是ON作为字符串中的最后一个单词存在

您可以使用正则表达式来执行此操作。 根据您的描述,我认为以下应该有效。

myString = prxchange("s/(.*)( ON)$/$1 ONTARIO/",-1,strip(myString));
myString = prxchange("s/(.*)( QC)$/$1 QUEBEC/",-1,strip(myString));

使用单独的控制数据集来维护您想要的替换(邮政编码 -> 省)。

  • 将控制数据加载到 hash
  • 处理扫描出最后一个“单词”的数据
  • 如果单词是 hash 中的键,则将单词替换为省值。

假设您仅对“令牌”(CA 邮政编码)作为最终“单词”执行转换,控制数据、数据和转换的示例如下:

data O_Canada(label="Our home and native land");
length postal $2 province $26 ;
input postal& province&;     * suffix & means data fields separated by >1 space;
datalines;
ON  Ontario
QC  Quebec
NS  Nova Scotia
NB  New Brunswick
MB  Manitoba
BC  British Columbia
PE  Prince Edward Island
SK  Saskatchewan
AB  Alberta
NL  Newfoundland and Labrador
;

data cities(label='Some popular places');
length place $100;
input place $CHAR50.;
datalines;
CALGARY AB
VANCOUVER BC
WINNIPEG MB
MONCTON NB
ST. JONHS NL
HALIFAX NS
TORONTO ON
MONTREAL QC
SAKATOON SK
CHARLOTTETOWN PE
WHITEHORSE YT
YELLOWKNIFE NT
IQALUIT NU
GOLDMINE YUKON
;

data cities;
  modify cities;

  if _n_ = 1 then do;
    length postal $3 province $26;  * postal 1 bigger so scanned postal will not always match;
    declare hash provinces(dataset:'O_Canada');
    provinces.defineKey('postal');
    provinces.defineData('province');
    provinces.defineDone();
    call missing(postal, province);
    drop postal province;
  end;

  postal = scan(place,-1,' ');
  if provinces.find() eq 0 then do;

    * this inline replacement presumes all postal codes are 2 characters;
    * -1 from length will replace starting from found postal;
    
    substr(place,length(place)-1) = province;  * inline replacement;

    replace;
  end; 
run;

结果
在此处输入图像描述

scan(myString, -1)返回myString中的最后一个单词, trim(myString)删除尾随空格,因此在数据步骤中,这可以完成工作:

cutString = substr(myString, length(myString) - 2);
select scan(myString, -1) 
    when 'ON' myString = cutString || 'ONTARIO';
    when 'QC' myString = cutString || 'QUEBEC';
end;

或在 SQL

select case scan(myString, -1) 
            when 'ON' then trim(myString) || 'TARIO' 
            when 'QC' then substr(myString, length() - 2) || 'QUEBEC'
            else myString end as myString 
from YOU_KNOW_BETTER_THAN_I_DO;
data GEOGRAPHY;
  file datalines truncover;
  informat geo $2. graphy $32.;
  input geo $ graphy $;
  datalines;
ON ONTARIO
QC QUEBEC
;
proc sql;
  select whatever_you_want, 
         case graphy 
              when '' then myString
              else substr(myString, length(myString) - length(geo)) || graphy 
         end as myString
  from HAVE left joion GEOGRAPHY on scan(myString, -1) eq geo;
quit;

@Sonny,我认为正则表达式非常好。 还有@astel,还有另一种容易理解的方法:

data test;
  InText = 'HOUSE ON THE HILL';
  output;
  InText = 'OTTAWA ON';
  output;
run;

data _null_;
  set test;

  if cats(reverse(InText)) =: 'NO ' then OutText = tranwrd(InText,' ON',' ONTARIO');
  put Intext = @30 OutText = ;
run;

output 将是

InText=HOUSE ON THE HILL     OutText=
InText=OTTAWA ON             OutText=OTTAWA ONTARIO

反转变量,以便您可以轻松判断新变量是否以NO开头,这意味着原始变量以ON结尾。 然后使用tranwrd()函数进行替换工作。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM