简体   繁体   中英

Optimize SAS proc sql

I am joining 2 tables and creating a mini Cartesian join between them so that all businesses within a city and state are matched up, then I am using some fuzzy logic to try and match business name and street name. There are ~3 million records on the input table and ~25 million records on the output table, so it is taking an extremely long time to run. I have created indexes on all the columns being joined and all columns being used in the where statement.

My next thought was to replace the city/state names with integers but I'd be adding processing time to create those tables. Does anyone have any other thoughts on decreasing the processing time.

proc sql;
create index output_stname on tbl._output (output_stname);
create index output_namevar on tbl._output (output_namevar);
create index key on tbl._output (key);
create index city on tbl._output (city);
create index state on tbl._output (state);

create index input_stname on tbl._input (input_stname);
create index input_namevar on tbl._input (input_namevar);
create index key_input on tbl._input (key_input);
create index city_input on tbl._input (city_input);
create index state_input on tbl._input (state_input);
;
quit;

proc sql;
create table tbl._level2 as
select distinct
key_input,
name_input,
address_input,
city_input,
state_input,
zip_input,
key,
business_nm1,
address,
city,
state,
zip,
'2 - Street Name & Business Name Match' as matchtype

from tbl._input a
left join tbl._output b on a.city_input=b.city and a.state_input=b.state
where 
compged(a.input_stname,b.output_stname) <= 50 and 
compged(input_namevar,output_namevar) <= 50
and case 
    when length(strip(a.input_namevar)) <= 2 then 1
    when length(strip(b.output_namevar)) <= 2 then 1
        else 0
end = 0
;
quit;

I would start with a composite index on the output table:

proc sql;
    create index output_stname on tbl._output (state, city, output_stname, output_namevar);

This should speed the joins. However, the select distinct is still suspicious. It is generally better to not have to use select distinct .

I would suggest not processing this with SQL. The SQL optimizer can't really optimize this very well due to the COMPGED and the CASE statements, as it doesn't really know how often those are going to be true; and the COMPGED is very expensive. As such you're going to get a very slow process in any event.

Most likely, a hash solution is best. It's hard to say without looking at the data (how many city/state pairs are there, for example - are there a huge number of unique ones, or a relatively small number?). But a hash solution will likely be faster, particularly as it avoids the index creation step, assuming you can fit the output table into the hash (or, alternately, fit the input table into the hash) in memory.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM