简体   繁体   English

优化查询以返回大量记录,这是避免数百个联接的一种方式。 这是一个聪明的解决方案吗?

[英]Optimizing a query returning a lot of records, a way to avoid hundreds of join. Is it a smart solution?

I am not so int SQL and I have the following doubt about how to optimize a query. 我不是很聪明的SQL,并且对如何优化查询存在以下疑问。 I am using MySql 我正在使用MySql

I have this DB schema: 我有这个数据库架构:

在此处输入图片说明

And this is the query that returns the last price (the last date into the Market_Commodity_Price_Series table) of a specific commodity into a specific market. 这是将特定商品进入特定市场的最后价格( Market_Commodity_Price_Series表中的最后日期)返回的查询。

It contains a lot of join to retrieve all the related information: 它包含许多联接以检索所有相关信息:

SELECT MCPS.id AS series_id,
        MD_CD.market_details_id AS market_id,
        MD_CD.commodity_details_id AS commodity_id,
        MD.market_name AS market_name,
        MCPS.price_date AS price_date,
        MCPS.avg_price AS avg_price,
        CU.ISO_4217_cod AS currency, 
        MU.unit_name AS measure_unit, 
        CD.commodity_name_en,
        CN.commodity_name 
FROM Market_Commodity_Price_Series AS MCPS
INNER JOIN MeasureUnit AS MU ON MCPS.measure_unit_id = MU.id
INNER JOIN Currency AS CU ON MCPS.currency_id = CU.id
INNER JOIN MarketDetails_CommodityDetails AS MD_CD ON MCPS.market_commodity_details_id = MD_CD.id
INNER JOIN MarketDetails AS MD ON MD_CD.market_details_id = MD.id
INNER JOIN CommodityDetails AS CD ON MD_CD.commodity_details_id = CD.id
INNER JOIN CommodityName AS CN ON CD.id = CN.commodity_details_id
INNER JOIN Languages AS LN ON CN.language_id  = LN.id
WHERE MD.id = 4
AND CD.id = 4 
AND LN.id=1
ORDER BY price_date DESC LIMIT 1

My doubt is: using the previous query I am extracting all the records related to a specific commodity into a specific market from the Market_Commodity_Price_Series table, do a lot of join, ordinating these records based on the price_date field and limiting to the last one. 我的疑问是:使用上一个查询,我将从Market_Commodity_Price_Series表中提取与特定商品有关的所有记录到特定市场中,进行大量联接 ,并根据price_date字段整理这些记录并限制为最后一个。

I think that it could be expansive because I can have a lot of records (because the Market_Commodity_Price_Series table contains daily information). 我认为这可能会花费很多,因为我可以拥有很多记录(因为Market_Commodity_Price_Series表包含每日信息)。

This query works but I think that can be done in a smarter way. 该查询有效,但我认为可以通过更智能的方式来完成。

So I thought that I can do something like this: 所以我认为我可以做这样的事情:

1) Select the record related to the last price of a specific commodity into a specific market using a query like this: 1)使用类似这样的查询来选择与特定商品的最后价格进入特定市场有关的记录:

SELECT measure_unit_id, 
        currency_id, 
        market_commodity_details_id, 
        MAX(price_date) price_date
FROM Market_Commodity_Price_Series  AS MCPS 
INNER JOIN MarketDetails_CommodityDetails AS MD_CD ON MCPS.market_commodity_details_id = MD_CD.id
WHERE MD_CD.market_details_id = 4
AND MD_CD.commodity_details_id = 4
GROUP BY measure_unit_id, currency_id, market_commodity_details_id

that returns the single record related to this information: 返回与该信息有关的单个记录:

measure_unit_id      currency_id          market_commodity_details_id price_date
--------------------------------------------------------------------------------
1                    2                    24                          05/10/2017

Use this output like a table (I don't know the exact name, maybe view, is it?) and join this "table" to the other required information that are into the MeasureUnit, Currency, MarketDetails, CommodityDetails, CommodityName and Languages tables. 像表一样使用此输出(我不知道确切的名称,也许是视图?),然后将此“表”与MeasureUnit,Currency,MarketDetails,CommodityDetails,CommodityName和Languages表中的其他必需信息连接。

I think that it could be better because in this way I am using the MAX(price_date) price_date to extract only the record related to the latest price into the Market_Commodity_Price_Series instead obtain all the records, ordering and limiting to the latest one. 我认为可能会更好,因为这样我将使用MAX(price_date)price_date仅将与最新价格相关的记录提取到Market_Commodity_Price_Series中,而不是获取所有记录,从而对最新记录进行排序和限制。

Furthermore most onf the JOIN operation are doing o the single record returned by the previous query and not on all the records returned by the first version of my query (potentially they could be hundreds or thousands). 此外,大多数JOIN操作都在执行上一个查询返回的单个记录,而不是在我的查询的第一个版本返回的所有记录上执行(可能是成百上千个)。

Could be a smart solution? 可能是一个聪明的解决方案?

If yes...what is the correct syntax to join the output of this query (considering it as a table) with the other tables? 如果是,那么此查询的输出(将其视为表)与其他表的正确语法是什么?

JOIN s -- particularly on primary keys -- are not necessarily expensive. JOIN尤其是在主键上-不一定昂贵。 It looks like your joins are following the data model. 看起来您的联接正在遵循数据模型。

I wouldn't start optimizing the query without understanding its performance characteristics. 如果不了解查询的性能特征,就不会开始优化查询。 How long does it take to run? 运行需要多长时间? How many records are being sorted to get the most recent? 要对多少条记录进行排序以获取最新记录?

Your WHERE clause appears to be limiting the data considerably. 您的WHERE子句似乎在极大地限制数据。 You can also set up an index to help with the WHERE clause clause -- however, because the fields come from different tables, it can be tricky to use indexes or all of them. 您还可以设置索引以帮助使用WHERE子句子句-但是,由于字段来自不同的表,因此使用索引或全部使用索引可能很棘手。

You have a complicated data model that is a bit difficult to follow. 您有一个复杂的数据模型,很难遵循。 It seems possible that you are getting a Cartesian product due to multiple nm relationships. 由于多个nm关系,您似乎正在获得笛卡尔积。 If so, that can have a big impact on performance, and pre-aggregating the data along each dimension is the way to go. 如果是这样,那可能会对性能产生很大的影响,并且沿着每个维度预聚合数据是正确的方法。

However, I wouldn't start optimizing the query without understanding how the current one behaves. 但是,如果不了解当前查询的行为,我就不会开始优化查询。

一种方法是制作一个单独的读取模型表,它来自CQRS方法 ,其中包含仅用于选择且不包含任何联接的所有必需属性,但是每次其他一些表更改一个时,您将需要更新读取模型表。创建一个视图

You've done a reasonably good job of writing an efficient query. 您在编写有效查询方面做得相当不错。

You didn't use SELECT * , which can mess up performance in a query with lots of joins, because it generates bloated and redundant intermediate result sets. 您没有使用SELECT * ,因为它会产生肿且多余的中间结果集,因此可能会破坏具有大量联接的查询的性能。 But your intermediate result set -- the one you apply ORDER BY to -- is not bloated. 但是您的中间结果集(您对ORDER BY应用的结果集)并没有肿。

Your WHERE col = val clauses mostly mention primary keys of tables (I guess). 您的WHERE col = val子句主要提到表的主键(我想)。 That's good. 那很好。

Your big table Market_Commodity_Price_Series could maybe use a compound covering index . 您的大表Market_Commodity_Price_Series可能使用复合覆盖索引 Similarly, some other tables may need that kind of index. 同样,其他一些表可能需要这种索引。 But that should be the topic of another question. 但这应该是另一个问题的话题。

Your proposed optimization -- ordering an intermediate result set consisting mostly of id values -- would help a lot if you were doing ORDER BY ... LIMIT and using the LIMIT function to discard most of your results. 如果您正在执行ORDER BY ... LIMIT并使用LIMIT函数丢弃大部分结果,则建议的优化方法(订购主要由id值组成的中间结果集)将大有帮助。 But you are not doing that. 但是您没有这样做。

Without knowing more about your data, it's hard to offer a crisp opinion. 如果不了解您的数据,就很难提供明确的意见。 But, if it were me I'd use your first query. 但是,如果是我,我将使用您的第一个查询。 I'd keep an eye on it as you go into production (and on other complex queries). 在您投入生产时(以及其他复杂的查询中),我会密切注意。 When (not if) performance starts to deteriorate, then you can do EXPLAIN and figure out the best way to index your tables. 当(如果不是)性能开始下降时,则可以执行EXPLAIN并找出索引表的最佳方法。 You've done a good job of writing a query that will get your application up and running. 您已经编写了一个可以使您的应用程序启动并运行的查询,已经做得很好。 Go with it! 去吧!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM