简体   繁体   中英

SAS Enterprise Guide / SQL Performance

I'm looking for a little guidance on a SAS/SQL performance issue I'm having. In SAS Enterprise Guide, I've created a program that creates a table. This table has about 90k rows:

CREATE TABLE test AS (
  SELECT id, SUM(myField)
  FROM table1
  GROUP BY id
)

I have a much larger table with millions of rows. Each row has an id. I want to sum values on this table, using only id's present in the 'test' table. I tried this:

CREATE TABLE test2 AS(
  SELECT big.id, SUM(big.myOtherField)
  FROM big
  INNER JOIN test
    ON test.id = big.id
  GROUP BY big.id
)

The problem I'm having is that it takes forever to run the second query against the big table with millions of records. I thought the inner join on the subset of id's would help (and maybe it is) but I wanted to make sure I was doing everything I could to speed it up.

I don't have any way to get information on the indexing of the underlying database. I'm more interested in getting the opinion of someone who has more SQL and SAS experience than me.

From what you show in your question, you are joining two SAS data sets, not two database objects. In any case, you can speed up the processing by defining indexes on the JOIN columns used in each table. Assuming you have permission to do so, here are examples:

proc sql;
   create index id on big(id);
   create index id on test(id);
quit;

Of course, you probably should first check the table definition before doing that. You can use the "describe" statement to see the structure:

proc sql;
   describe table big;
quit;

Indexes improve access performance at the cost of disk space and update maintenance. Once created, the indexes will be a permanent part of the SAS data set and will be automatically updated if you use SQL INSERT or DELETE statements. But be aware that the indexes will be deleted if you recreate the data set with a simple data step.

On the other hand, if these tables really are in an external database (like Oracle for example), you have a different challenge. If that's the case, I'd ask a new question and provide a complete example of the SAS code you are using (including and libname statements).

If you are working with non-SAS data, ie, data that resides in a SQL DB or a no-SQL database for that matter, you will see significant improvements in performance using pass-through SQL or, if supported and you have the licenses for it, in-database processing.

One important point about proc sql vs pass-through sql. Proc sql, by default, creates duplication of the original source data in SAS datasets prior to doing the work. Whereas, pass-through just requests the result set from the source data provider. In short, you can imagine that a table with 5 million rows will take a lot longer to use with proc sql (even if you are only interested in about 1% of the data) than if you just have to pull that 1% of data across the network using the pass-through mechanism.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM