简体   繁体   中英

Removing “Using temporary; Using filesort” from this MySQL select+join+group by

I have the following query:

select 
    t.Chunk as LeftChunk,
    t.ChunkHash as LeftChunkHash,
    q.Chunk as RightChunk,
    q.ChunkHash as RightChunkHash,
    count(t.ChunkHash) as ChunkCount
from
    chunks as t
    join
    chunks as q
    on
        t.ID = q.ID
group by LeftChunkHash, RightChunkHash

And the following explain table:

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE  t   ALL IDIndex NULL    NULL    NULL    17796190    "Using temporary; Using filesort"
1   SIMPLE  q   ref IDIndex IDIndex 4   sotero.t.Id 12  

note the "using temporary; using filesort".

When this query is run, I quickly run out of RAM (presumably b/c of the temp table), and then the HDD kicks in, and the query slows to a halt.

I thought it might be an index issue, so I started adding a few that sort of made sense:

Table   Non_unique  Key_name    Seq_in_index    Column_name Collation   Cardinality Sub_part    Packed  Null    Index_type  Comment Index_comment
chunks  0   PRIMARY 1   ChunkId A   17796190    NULL    NULL        BTREE       
chunks  1   ChunkHashIndex  1   ChunkHash   A   243783  NULL    NULL        BTREE       
chunks  1   IDIndex 1   Id  A   1483015 NULL    NULL        BTREE       
chunks  1   ChunkIndex  1   Chunk   A   243783  NULL    NULL        BTREE       
chunks  1   ChunkTypeIndex  1   ChunkType   A   2   NULL    NULL        BTREE       
chunks  1   chunkHashByChunkIDIndex 1   ChunkHash   A   243783  NULL    NULL        BTREE       
chunks  1   chunkHashByChunkIDIndex 2   ChunkId A   17796190    NULL    NULL        BTREE       
chunks  1   chunkHashByChunkTypeIndex   1   ChunkHash   A   243783  NULL    NULL        BTREE       
chunks  1   chunkHashByChunkTypeIndex   2   ChunkType   A   261708  NULL    NULL        BTREE       
chunks  1   chunkHashByIDIndex  1   ChunkHash   A   243783  NULL    NULL        BTREE       
chunks  1   chunkHashByIDIndex  2   Id  A   17796190    NULL    NULL        BTREE       

But still using the temporary table.

The db engine is MyISAM.

How can I get rid of the using temporary; using filesort in this query?

Just changing to InnoDB w/o explaining the underlying cause is not a particularly satisfying answer. Besides, if the solution is to just add the proper index, then that's much easier than migrating to another db engine.

I am new to relational databases. So I'm hoping that the solution is something obvious to the experts.

EDIT1:

ID is not the primary key. ChunkID is. There are approximately 40 ChunkIDs for each ID. So adding an additional ID to the table adds about 40 rows. Each unique chunk has a unique chunkHash associated with it.

EDIT2:

Here's the schema:

Field   Type    Null    Key Default Extra
ChunkId int(11) NO  PRI NULL    
ChunkHash   int(11) NO  MUL NULL    
Id  int(11) NO  MUL NULL    
Chunk   varchar(255)    NO  MUL NULL    
ChunkType   varchar(255)    NO  MUL NULL    

EDIT 3:

The end objective of the query is to create a table of word co-occurrences across documents. ChunkIDs are word instances. Each instance is a word that is associated with a particular document (ID). About 40 words present per document. About 1 million documents. So the resulting table of co-occurrences is highly compressed compared to the full cross-product temporary table that is (apparently) being created. That is, the full cross-product temp table is 1 mil * 40 * 40 = 1.6 billion rows. The compressed resulting table is estimated at about 40 million rows.

EDIT 4:

Adding postgresql tag to see if any postgresql users can get a better execution plan on that SQL implementation. If that's the case, I'll switch over.

Updated with a query that produces the same results. It won't be any faster though.

Create Index IX_ID On Chunks (ID);

Select
  LeftChunk,
  LeftChunkHash,
  RightChunk,
  RightChunkHash,
  Sum(ChunkCount)
From (
  Select 
    t.Chunk as LeftChunk,
    t.ChunkHash as LeftChunkHash,
    q.Chunk as RightChunk,
    q.ChunkHash as RightChunkHash,
    count(t.ChunkHash) as ChunkCount
  From
    chunks as t
      inner join
    chunks as q
      on t.ID = q.ID
  Group By
    t.ID,
    t.ChunkHash,
    q.ChunkHash 
  ) x
Group By
  LeftChunk,
  LeftChunkHash,
  RightChunk,
  RightChunkHash

Fiddle with example test data http://sqlfiddle.com/#!3/ea1a5/2

Latest Fiddle, with the problem reformulated as words and documents: http://sqlfiddle.com/#!3/f5aef/12

With the problem reformulated as documents and words, how many documents do you have, how many words, and how many document words?

Also, using the documents and words analogy, would you say your query is "For all pairs of words that appear in a document together, how often do they appear together in any document. If word A appears n times in a document and word B m times in the same document, then this counts as n * m times in the total."

How about summarizing the table before the join?

The summary might be:

 select count(*) count,
        Chunk,
        ChunkHash
   from chunks
  group by Chunk, ChunkHash

Then the join would be:

Select r.Chunk as RightChunk,
       r.ChunkHash as RightChunkHash,
       l.Chunk as LeftChunk,
       l.ChunkHash as LeftChunkHash
       sum (l.Count) + sum(r.Count) as Count
  from (
        select count(*) count,
               Chunk,
               ChunkHash
          from chunks
      group by Chunk, ChunkHash
       ) l
  join (
        select count(*) count,
               Chunk,
               ChunkHash
          from chunks
      group by Chunk, ChunkHash
       ) r on l.Chunk = r.Chunk
 group by r.Chunk, r.ChunkHash, l.Chunk, l.ChunkHash

The thing I'm not sure about is what you're counting, exactly. So my SUM() + SUM() is a guess. You might want SUM() * SUM().

Also, I'm assuming that two Chunk values are equal if and only if ChunkHash values are equal.

I migrated from MySQL to PostgreSQL, and query execution time went from ~1.5 days to ~10 mins.

Here's the PostgreSQL query execution plan:

在此处输入图片说明

I am no longer using MySQL.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM