简体   繁体   English

使用带有Teradata的Proc sql在SAS中编写高效查询

[英]Writing Efficient Queries in SAS Using Proc sql with Teradata

EDIT: Here is a more complete set of code that shows exactly what's going on per the answer below. 编辑:这是一套更完整的代码,可以根据下面的答案准确显示正在进行的操作。

libname output '/data/files/jeff'
%let DateStart = '01Jan2013'd;
%let DateEnd = '01Jun2013'd;
proc sql;
CREATE TABLE output.id AS (
  SELECT DISTINCT id
  FROM mydb.sale_volume AS sv
  WHERE sv.category IN ('a', 'b', 'c') AND
    sv.trans_date BETWEEN &DateStart AND &DateEnd
)
CREATE TABLE output.sums AS (
  SELECT id, SUM(sales)
  FROM mydb.sale_volue AS sv
  INNER JOIN output.id AS ids
    ON ids.id = sv.id
  WHERE sv.trans_date BETWEEN &DateStart AND &DateEnd
  GROUP BY id
)
run;

The goal is to simply query the table for some id's based on category membership. 目标是根据类别成员资格在表中查询某些id。 Then I sum these members' activity across all categories. 然后,我将这些成员的活动汇总到所有类别。

The above approach is far slower than: 上述方法远比以下方法慢:

  1. Running the first query to get the subset 运行第一个查询以获取子集
  2. Running a second query the sums every ID 运行第二个查询会对每个ID求和
  3. Running a third query that inner joins the two result sets. 运行内部连接两个结果集的第三个查询。

If I'm understanding correctly, it may be more efficient to make sure that all of my code is completely passed through rather than cross-loading. 如果我理解正确,确保我的所有代码都完全通过而不是交叉加载可能更有效。


After posting a question yesterday, a member suggested I might benefit from asking a separate question on performance that was more specific to my situation. 在昨天发布一个问题之后,一位成员建议我可以从提出更具体的情况的单独的问题中获益。

I'm using SAS Enterprise Guide to write some programs/data queries. 我正在使用SAS Enterprise Guide编写一些程序/数据查询。 I don't have permissions to modify the underlying data, which is stored in 'Teradata'. 我没有权限修改存储在“Teradata”中的基础数据。

My basic problem is writing efficient SQL queries in this environment. 我的基本问题是在这种环境中编写高效的SQL查询。 For example, I query a large table (with tens of millions of records) for a small subset of ID's. 例如,我为一小部分ID查询一个大表(有数千万条记录)。 Then, I use this subset to query the larger table again: 然后,我使用此子集再次查询更大的表:

proc sql;
CREATE TABLE subset AS (
  SELECT
    id
  FROM
    bigTable
  WHERE
    someValue = x AND
    date BETWEEN a AND b

)

This works in a matter of seconds and returns 90k ID's. 这可以在几秒钟内完成,并返回90k ID。 Next, I want to query this set of ID's against the big table, and problems ensue. 接下来,我想在大表中查询这组ID,然后出现问题。 I'm wanting to sum values over time for the ID's: 我希望随着时间的推移对ID的值进行求和:

proc sql;
CREATE TABLE subset_data AS (
  SELECT
    bigTable.id,
    SUM(bigTable.value) AS total
  FROM
    bigTable
  INNER JOIN subset
    ON subset.id = bigTable.id
  WHERE
    bigTable.date BETWEEN a AND b
  GROUP BY
    bigTable.id
)

For whatever reason, this takes a really long time. 无论出于何种原因,这需要很长时间。 The difference is that the first query flags 'someValue'. 区别在于第一个查询标记'someValue'。 The second looks at all activity, regardless of what's in 'someValue'. 第二个是查看所有活动,不管'someValue'中的内容是什么。 For example, I could flag every customer who orders a pizza. 例如,我可以标记每个订购披萨的顾客。 Then I would look at every purchase for all customers who ordered pizza. 然后我会查看订购披萨的所有客户的每次购买。

I'm not overly familiar with SAS so I'm looking for any advice on how to do this more efficiently or speed things up. 我对SAS并不太熟悉,所以我正在寻找有关如何更有效地做到这一点或加快速度的任何建议。 I'm open to any thoughts or suggestions and please let me know if I can offer more detail. 我对任何想法或建议持开放态度,如果我能提供更多细节,请告诉我。 I guess I'm just surprised the second query takes so long to process. 我想我很惊讶第二个查询需要很长时间来处理。

The most critical thing to understand when using SAS to access data in Teradata (or any other external database for that matter) is that the SAS software prepares SQL and submits it to the database. 使用SAS访问Teradata(或任何其他外部数据库)中的数据时,最重要的一点是SAS软件准备SQL并将其提交到数据库。 The idea is to try and relieve you (the user) from all the database specific details. 我们的想法是尝试让您(用户)从所有数据库特定的细节中解脱出来。 SAS does this using a concept called "implict pass-through", which just means that SAS does the translation from SAS code into DBMS code. SAS使用称为“implict pass-through”的概念来实现这一点,这意味着SAS将SAS代码转换为DBMS代码。 Among the many things that occur is data type conversion: SAS only has two (and only two) data types, numeric and character. 发生的很多事情都是数据类型转换:SAS只有两种(只有两种)数据类型,数字和字符。

SAS deals with translating things for you but it can be confusing. SAS处理为您翻译的事情,但这可能令人困惑。 For example, I've seen "lazy" database tables defined with VARCHAR(400) columns having values that never exceed some smaller length (like column for a person's name). 例如,我见过用VARCHAR(400)列定义的“懒惰”数据库表,其值永远不会超过一些较小的长度(如人名的列)。 In the data base this isn't much of a problem, but since SAS does not have a VARCHAR data type, it creates a variable 400 characters wide for each row. 在数据库中,这不是什么大问题,但由于SAS没有VARCHAR数据类型,因此每行创建一个宽度为400个字符的变量。 Even with data set compression, this can really make the resulting SAS dataset unnecessarily large. 即使使用数据集压缩,这也可能会使得到的SAS数据集不必要地变大。

The alternative way is to use "explicit pass-through", where you write native queries using the actual syntax of the DBMS in question. 另一种方法是使用“显式传递”,使用相关DBMS的实际语法编写本机查询。 These queries execute entirely on the DBMS and return results back to SAS (which still does the data type conversion for you. For example, here is a "pass-through" query that performs a join to two tables and creates a SAS dataset as a result: 这些查询完全在DBMS上执行,并将结果返回给SAS(它仍然为您进行数据类型转换。例如,这是一个“传递”查询,它执行两个表的连接并创建一个SAS数据集作为结果:

proc sql;
   connect to teradata (user=userid password=password mode=teradata);
   create table mydata as
   select * from connection to teradata (
      select a.customer_id
           , a.customer_name
           , b.last_payment_date
           , b.last_payment_amt
      from base.customers a
      join base.invoices b
      on a.customer_id=b.customer_id
      where b.bill_month = date '2013-07-01'
        and b.paid_flag = 'N'
      );
quit;

Notice that everything inside the pair of parentheses is native Teradata SQL and that the join operation itself is running inside the database. 请注意,括号内的所有内容都是本机Teradata SQL,并且连接操作本身在数据库中运行。

The example code you have shown in your question is NOT a complete, working example of a SAS/Teradata program. 您在问题中显示的示例代码不是 SAS / Teradata程序的完整工作示例。 To better assist, you need to show the real program, including any library references. 为了更好地提供帮助,您需要显示真实的程序,包括任何库引用。 For example, suppose your real program looks like this: 例如,假设您的真实程序如下所示:

proc sql;
   CREATE TABLE subset_data AS
   SELECT bigTable.id,
          SUM(bigTable.value) AS total
   FROM   TDATA.bigTable bigTable
   JOIN   TDATA.subset subset
   ON     subset.id = bigTable.id
   WHERE  bigTable.date BETWEEN a AND b
   GROUP BY bigTable.id
   ;

That would indicate a previously assigned LIBNAME statement through which SAS was connecting to Teradata. 这将指示先前分配的LIBNAME语句,SAS通过该语句连接到Teradata。 The syntax of that WHERE clause would be very relevant to if SAS is even able to pass the complete query to Teradata. 如果SAS甚至能够将完整查询传递给Teradata,那么该WHERE子句的语法将非常相关。 (You example doesn't show what "a" and "b" refer to. It is very possible that the only way SAS can perform the join is to drag both tables back into a local work session and perform the join on your SAS server. (您的示例未显示“a”和“b”所指的内容.SAS可以执行连接的唯一方法是将两个表拖回本地工作会话并在SAS服务器上执行连接。

One thing I can strongly suggest is that you try to convince your Teradata administrators to allow you to create "driver" tables in some utility database. 我强烈建议的一件事是,您试图说服您的Teradata管理员允许您在某个实用程序数据库中创建“驱动程序”表。 The idea is that you would create a relatively small table inside Teradata containing the ID's you want to extract, then use that table to perform explicit joins. 我们的想法是,您将在Teradata中创建一个包含要提取的ID的相对较小的表,然后使用该表执行显式连接。 I'm sure you would need a bit more formal database training to do that (like how to define a proper index and how to "collect statistics"), but with that knowledge and ability, your work will just fly. 我相信你需要更正式的数据库培训才能做到这一点(比如如何定义一个合适的索引以及如何“收集统计数据”),但凭借这些知识和能力,你的工作就会飞翔。

I could go on and on but I'll stop here. 我可以继续,但我会在这里停下来。 I use SAS with Teradata extensively every day against what I'm told is one of the largest Teradata environments on the planet. 我每天都广泛使用SAS和Teradata,而我所说的是这个星球上最大的Teradata环境之一。 I enjoy programming in both. 我喜欢两种编程。

You imply an assumption that the 90k records in your first query are all unique id s. 您暗示假设您的第一个查询中的90k记录都是唯一id Is that definite? 那是明确的吗?

I ask because the implication from your second query is that they're not unique. 我问,因为你的第二个问题的含义是它们不是唯一的。
- One id can have multiple values over time, and have different somevalue s - 一个id随着时间的推移可以有多个值,并且具有不同的somevalue

If the id s are not unique in the first dataset, you need to GROUP BY id or use DISTINCT , in the first query. 如果id在第一个数据集中不唯一,则需要在第一个查询中使用GROUP BY id或使用DISTINCT

Imagine that the 90k rows consists of 30k unique id s, and so have an average of 3 rows per id . 想象一下,90k行由30k个唯一id组成,因此每个id平均有3行。

And then imagine those 30k unique id s actually have 9 records in your time window, including rows where somevalue <> x . 然后想象一下,那些30k的唯一id在你的时间窗口中实际上有9条记录,包括somevalue <> x行。

You will then get 3x9 records back per id . 然后,您将获得每个id 3x9记录。

And as those two numbers grow, the number of records in your second query grows geometrically. 随着这两个数字的增长,第二个查询中的记录数量会逐渐增长。


Alternative Query 替代查询

If that's not the problem, an alternative query (which is not ideal, but possible) would be... 如果这不是问题,那么另一种查询(这不是理想的,但可能的)将是......

SELECT
  bigTable.id,
  SUM(bigTable.value) AS total
FROM
  bigTable
WHERE
  bigTable.date BETWEEN a AND b
GROUP BY
  bigTable.id
HAVING
  MAX(CASE WHEN bigTable.somevalue = x THEN 1 ELSE 0 END) = 1

If ID is unique and a single value, then you can try constructing a format. 如果ID是唯一且是单个值,那么您可以尝试构建格式。

Create a dataset that looks like this: 创建一个如下所示的数据集:

fmtname, start, label

where fmtname is the same for all records, a legal format name (begins and ends with a letter, contains alphanumeric or _); 其中fmtname对于所有记录都是相同的,合法的格式名称(以字母开头和结尾,包含字母数字或_); start is the ID value; start是ID值; and label is a 1. Then add one row with the same value for fmtname, a blank start, a label of 0, and another variable, hlo='o' (for 'other'). 并且标签是1.然后为fmtname添加一行,空白开始,标签为0,另一个变量为hlo='o' (对于'other')。 Then import into proc format using the CNTLIN option, and you now have a 1/0 value conversion. 然后使用CNTLIN选项导入proc格式,现在您的值转换为1/0。

Here's a brief example using SASHELP.CLASS. 这是使用SASHELP.CLASS的简短示例。 ID here is name, but it can be numeric or character - whichever is right for your use. 这里的ID是名称,但它可以是数字或字符 - 适合您的使用。

data for_fmt;
set sashelp.class;
retain fmtname '$IDF'; *Format name is up to you.  Should have $ if ID is character, no $ if numeric;
start=name; *this would be your ID variable - the look up;
label='1';
output;
if _n_ = 1 then do;
  hlo='o';
  call missing(start);
  label='0';
  output;
end;
run;
proc format cntlin=for_fmt;
quit;

Now instead of doing a join, you can do your query 'normally' but with an additional where clause of and put(id,$IDF.)='1' . 现在不是进行连接,而是可以“正常”进行查询,但是使用另外的where子句and put(id,$IDF.)='1' This won't be optimized with an index or anything, but it may be faster than the join. 这不会使用索引或任何内容进行优化,但可能比连接更快。 (It may also not be faster - depends on how the SQL optimizer is working.) (它可能也不会更快 - 取决于SQL优化器的工作方式。)

If the id is unique you might add a UNIQUE PRIMARY INDEX(id) to that table, otherwise it defaults to a Non-unique PI. 如果id是唯一的,您可以向该表添加UNIQUE PRIMARY INDEX(id),否则它默认为非唯一PI。 Knowing about uniquenes helps the optimizer to produce a better plan. 了解uniquenes有助于优化者制定更好的计划。

Without more info like an Explain (just put EXPLAIN in front of the SELECT) it's hard to tell how this can be improved. 如果没有像Explain那样的更多信息(只是将EXPLAIN放在SELECT前面),很难说它是如何改进的。

One alternate solution is to use SAS procedures. 一种替代解决方案是使用SAS程序。 I don't know what your actual SQL is doing, but if you're just doing frequencies (or something else that can be done in a PROC), you could do: 我不知道你的实际SQL在做什么,但是如果你只是做频率(或者其他可以在PROC中完成的事情),你可以这样做:

proc sql;
create view blah as select ... (your join);
quit;

proc freq data=blah;
tables id/out=summary(rename=count=total keep=id count);
run;

Or any number of other options (PROC MEANS, PROC TABULATE, etc.). 或任何数量的其他选项(PROC MEANS,PROC TABULATE等)。 That may be faster than doing the sum in SQL (depending on some details, such as how your data is organized, what you're actually doing, and how much memory you have available). 这可能比在SQL中总和更快(取决于一些细节,例如您的数据组织方式,实际执行的内容以及可用内存量)。 It has the added benefit that SAS might choose to do this in-database, if you create the view in the database, which might be faster. 如果您在数据库中创建视图可能会更快,那么SAS可能会选择在数据库中执行此操作。 (In fact, if you just run the freq off the base table, it's possible that would be even faster, and then join the results to the smaller table). (实际上,如果你只是从基表运行freq,它可能会更快,然后将结果连接到较小的表)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM