简体   繁体   English

如何查找和替换SAS数据集中的特定文本?

[英]How can I find and replace specific text in a SAS data set?

I have a data set with 400 observations of 4 digit codes which I would like to pad with a space on both sides 我有一个数据集,包含400个4位数字代码的观察结果,我想用两边的空格填充

ex. Dataset 
obs code
1   1111 
2   1112
3   3333
.
.
.
400 5999

How can I go through another large data set and replace every occurrence of any of the padded 400 codes with a " ". 如何通过另一个大型数据集并用“”替换任何填充的400代码的每个匹配项。

ex. Large Dataset
obs text 
1   abcdef 1111 abcdef
2   abcdef 1111 abcdef 1112 8888
3   abcdef 1111 abcdef 11128888
... 

Data set that I want 我想要的数据集

ex. New Data set
obs text
1   abcdef   abcdef
2   abcdef   abcdef   8888
3   abcdef   abcdef 11128888
...

Note: I'm only looking to replace 4 digit codes that are padded on both sides by a space. 注意:我只想替换两边用空格填充的4位数代码。 So in obs 3, 1112 won't be replaced. 因此在obs 3中,1112将不会被替换。

I've tried doing the following proc sql statement, but it only finds and replaces the first match, instead of all the matches. 我尝试过以下proc sql语句,但它只找到并替换第一个匹配,而不是所有匹配。

proc sql;  
    select   
    *,  
    tranwrd(large_dataset.text, trim(small_dataset.code), ' ') as new_text  
from large_dataset  
    left join small_dataset  
    on findw(large_dataset.text, trim(small_dataset.code))
;
quit;

You could just use a DO loop to scan through the small dataset of codes for each record in the large dataset. 您可以使用DO循环扫描大型数据集中每条记录的小代码数据集。 If you want to use TRANWRD() function then you will need to add extra space characters. 如果要使用TRANWRD()函数,则需要添加额外的空格字符。

data want ;
  set have ;
  length code $4 ;
  do i=1 to nobs while (text ne ' ');
    set codes(keep=code) nobs=nobs point=i ;
    text = substr(tranwrd(' '||text,' '||code||' ',' '),2);
  end;
  drop code;
run;

The DO loop will read the records from your CODES list. DO循环将读取CODES列表中的记录。 Using the POINT= option on the SET statement lets you read the file multiple times. 使用SET语句中的POINT =选项可以多次读取文件。 The WHILE clause will stop if the TEXT string is empty since there is no need to keep looking for codes to replace at that point. 如果TEXT字符串为空,WHILE子句将停止,因为此时无需继续查找要替换的代码。

If your list of codes is small enough and you can get the right regular expression then you might try using PRXCHANGE() function instead. 如果您的代码列表足够小并且您可以获得正确的正则表达式,那么您可以尝试使用PRXCHANGE()函数。 You can use an SQL step to generate the codes as a list that you can use in the regular expression. 您可以使用SQL步骤将代码生成为可在正则表达式中使用的列表。

proc sql noprint ;
  select code into :codelist separated by '|'
  from codes
;
quit;

data want ;
  set have ;
  text=prxchange("s/\b(&codelist)\b/ /",-1,text);
run;

There might be more efficient ways of doing this, but this seems to work fairly well: 可能有更有效的方法来做到这一点,但这似乎工作得相当好:

/*Create test datasets*/
data codes;
input code;
cards;
1111 
1112
3333
5999
;
run;

data big_dataset;
infile cards truncover;
input text $100.;
cards;
abcdef 1111 abcdef
abcdef 1111 abcdef 1112 8888
abcdef 1111 abcdef 11128888
;
run;

/*Get the number of codes to use for array definition*/
data _null_;
    set codes(obs = 1) nobs = nobs;
    call symput('ncodes',nobs);
run;

%put ncodes = &ncodes;

data want;
    set big_dataset;
    /*Define and populate array with padded codes*/ 
    array codes{&ncodes} $6 _temporary_;
    if _n_ = 1 then do i = 1 to &ncodes;    
        set codes;
        codes[i] = cat(' ',put(code,4.),' '); 
    end;
    do i = 1 to &ncodes;
        text = tranwrd(text,codes[i],' ');
    end;
    drop i code;
run;

I expect a solution using prxchange is also possible, but I'm not sure how much work it is to construct a regex that matches all of your codes compared to just substituting them one by one. 我希望使用prxchange的解决方案也是可能的,但我不确定构建一个匹配所有代码的正则表达式与逐个替换它们相比有多少工作量。

Taking Tom's solution and putting the code-lookup into a hash-table. 采用Tom的解决方案并将代码查找放入哈希表中。 Thereby the dataset will only be read once and the actual lookup is quite fast. 因此,数据集将仅被读取一次并且实际查找非常快。 If the Large Dataset is really large this will make a huge difference. 如果大数据集真的很大,这将产生巨大的差异。

data want ;
  if _n_ = 1 then do;
    length code $4 ;
    declare hash h(dataset:"codes (keep=code)") ; 
    h.defineKey("code") ;
    h.defineDone() ;
    call missing (code);
    declare hiter hiter('h') ;
  end;
  set big_dataset ;

  rc = hiter.first() ;
  do while (rc = 0 and text ne ' ') ;
    text = substr(tranwrd(' '||text,' '||code||' ',' '),2) ;
    rc = hiter.next() ;
  end ;
  drop code rc ;
run;

Use array and regular express: 使用数组和常规表达:

proc transpose data=codes out=temp;
var code;
run;

data want;
if _n_=1 then  set temp;
array var col:;
set big_dataset;
do over var;
text = prxchange(cats('s/\b',var,'\b//'),-1,text);
end;
drop col:;
run;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM