简体   繁体   English

从内部表填充范围表,没有重复项

[英]Fill range table from internal table without duplicates

I wonder what's usually faster: 我想知道什么通常更快:

  • Filter out duplicates and then do the select 筛选出重复项,然后进行选择
    or 要么
  • Do the select directly with duplicates 直接选择重复项

I think it may be the first one but I don't know - 我认为这可能是第一个,但我不知道-
how to nicely and efficiently integrate the deletion of duplicates in my code? 如何很好地和有效地集成我的代码中重复项的删除?

DATA:
  lt_itab       TYPE TABLE OF string,
  lt_range_itab TYPE RANGE OF string
  .

* Populating itab with duplicates
* Can the following somehow become a neat one-liner? This is ugly!
APPEND '1'  TO lt_itab.
APPEND '2'  TO lt_itab.
APPEND '2'  TO lt_itab.
APPEND '3'  TO lt_itab.
APPEND '4'  TO lt_itab.
APPEND '4'  TO lt_itab.

*Populating range table from itab
*Should one remove the duplicates here for a performance boost in the upcoming select?
*If so - how?
*-------------------------------------------------------
lt_range_itab = VALUE #(
  FOR <ls_itab> IN lt_itab
  ( sign = 'I'
    option = 'EQ'
    low = <ls_itab> )
).

*...or is such a select usually faster than the time it takes to remove the duplicates?
*-------------------------------------------------------
*SELECT *
*  FROM       anyTable
*  INTO TABLE @DATA(lt_new_data)
*  WHERE      anyProperty NOT IN @lt_range_itab.

So many questions... :-) 这么多的问题... :-)

Edit: I revised this answer after the discussions in the answers 编辑:我在答案中的讨论后修改了此答案

What's usually faster? 通常什么更快?

Good databases will select fast even if there are some duplicates in the range. 好的数据库将快速选择,即使该范围中有一些重复项也是如此。 Oracle's optimizer, for example, removes duplicates on its own . 例如,Oracle的优化器自行删除重复项 SAP HANA, in comparison, may get slower, but its dictionary-based architecture will usually keep it on a negligible level. 相比之下,SAP HANA的速度可能会变慢,但其基于字典的体系结构通常会将其保持在可以忽略的水平。 So generally, I see no imperative to remove duplicates before each and every query. 因此,一般而言,我认为没有必要在每个查询之前删除重复项。

However, things may go awry if the optimizer is sub-optimal and there is a large number of duplicates. 但是,如果优化器不是最理想的并且存在大量重复项,则事情可能会出错。 So if you expect duplicates, it may be better to keep to the safe side and remove them before the query. 因此,如果您希望重复,最好还是保持安全并在查询前将其删除。

Also note that range tables have a length restriction. 另请注意,范围表具有长度限制。 They are translated to a SQL statement with an IN clause, and SQL statement strings have a maximum number of characters. 它们被转换为带有IN子句的SQL语句,并且SQL语句字符串具有最大数量的字符。 Duplicate removal thus may be a necessary strategy to get the query working at all. 因此,重复删除可能是使查询完全正常工作的必要策略。

Longer ranges can be converted to FOR ALL ENTRIES , which packetizes the query and allows for much longer ranges. 可以将更长的范围转换为FOR ALL ENTRIES ,从而对查询进行打包并允许更长的范围。 However, that statement form leads to multiple roundtrips with the database, and will definitely suffer from duplicates in the query. 但是,该语句形式导致与数据库的多次往返,并且肯定会受到查询中重复项的影响。

Can the following somehow become a neat one-liner? 以下内容可以以某种方式变成整洁的单线吗?

DATA(lt_itab) = VALUE string_table( ( `1` ) ( `2` ) ( `2` ) ( `3` ) ( `4` ) ( `4` ) ).

Or right away as suggested by Sandra below: 或立即按照下面的Sandra的建议进行:

SELECT ... WHERE anyProperty NOT IN ('1','2','3','4')

If so - how? 如果是这样-如何?

SORT lt_range_tab.
DELETE ADJACENT DUPLICATES FROM lt_range_tab.

Last but not least, note that an anyProperty IN @lt_range_tab may be considerably faster than the reversed NOT IN variant. 最后但并非最不重要的一点是请注意anyProperty IN @lt_range_tab可能比反转的NOT IN变体快得多。 Databases tend to keep positive indexes which respond best to positive queries. 数据库倾向于保留索引,该索引对正查询的响应最佳。 If you have the possibility, eg because you're filtering a field with fixed value list, it may be worthwhile to reverse the filter before sending it to the database. 如果有可能(例如,因为您要过滤具有固定值列表的字段),可能有必要在将过滤器发送到数据库之前反转过滤器。

The first one is faster of course, because the database operations with physical disks are much slower than in-memory operations. 第一个当然更快,因为使用物理磁盘进行的数据库操作要比使用内存中的操作慢得多。

Is the impact on performance noticeable is another question ; 对绩效的影响是否明显是另一个问题; it depends on the number of duplicate selections and on the volume of data. 它取决于重复选择的数量和数据量。

One well known example is the SELECT ... FOR ALL ENTRIES construct which can have a big impact on performance if the duplicates are not removed, because ABAP internally converts it into several SELECT, and so the same data may be selected several times (which is removed at ABAP side afterwards). 一个众所周知的示例是SELECT ... FOR ALL ENTRIES构造,如果不删除重复项,会对性能产生很大影响,因为ABAP在内部将其转换为多个SELECT,因此相同的数据可能会选择多次(然后在ABAP一侧移除)。

In brief, except when you are sure there is a small data volume, make sure there are no duplicates before a SELECT. 简而言之,除非您确定数据量很小,否则请确保在SELECT之前没有重复项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM