I have the following query:
select
t.Chunk as LeftChunk,
t.ChunkHash as LeftChunkHash,
q.Chunk as RightChunk,
q.ChunkHash as RightChunkHash,
count(t.ChunkHash) as ChunkCount
from
chunks as t
join
chunks as q
on
t.ID = q.ID
group by LeftChunkHash, RightChunkHash
And the following explain table:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t ALL IDIndex NULL NULL NULL 17796190 "Using temporary; Using filesort"
1 SIMPLE q ref IDIndex IDIndex 4 sotero.t.Id 12
note the "using temporary; using filesort".
When this query is run, I quickly run out of RAM (presumably b/c of the temp table), and then the HDD kicks in, and the query slows to a halt.
I thought it might be an index issue, so I started adding a few that sort of made sense:
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type Comment Index_comment
chunks 0 PRIMARY 1 ChunkId A 17796190 NULL NULL BTREE
chunks 1 ChunkHashIndex 1 ChunkHash A 243783 NULL NULL BTREE
chunks 1 IDIndex 1 Id A 1483015 NULL NULL BTREE
chunks 1 ChunkIndex 1 Chunk A 243783 NULL NULL BTREE
chunks 1 ChunkTypeIndex 1 ChunkType A 2 NULL NULL BTREE
chunks 1 chunkHashByChunkIDIndex 1 ChunkHash A 243783 NULL NULL BTREE
chunks 1 chunkHashByChunkIDIndex 2 ChunkId A 17796190 NULL NULL BTREE
chunks 1 chunkHashByChunkTypeIndex 1 ChunkHash A 243783 NULL NULL BTREE
chunks 1 chunkHashByChunkTypeIndex 2 ChunkType A 261708 NULL NULL BTREE
chunks 1 chunkHashByIDIndex 1 ChunkHash A 243783 NULL NULL BTREE
chunks 1 chunkHashByIDIndex 2 Id A 17796190 NULL NULL BTREE
But still using the temporary table.
The db engine is MyISAM.
How can I get rid of the using temporary; using filesort in this query?
Just changing to InnoDB w/o explaining the underlying cause is not a particularly satisfying answer. Besides, if the solution is to just add the proper index, then that's much easier than migrating to another db engine.
I am new to relational databases. So I'm hoping that the solution is something obvious to the experts.
EDIT1:
ID is not the primary key. ChunkID is. There are approximately 40 ChunkIDs for each ID. So adding an additional ID to the table adds about 40 rows. Each unique chunk has a unique chunkHash associated with it.
EDIT2:
Here's the schema:
Field Type Null Key Default Extra
ChunkId int(11) NO PRI NULL
ChunkHash int(11) NO MUL NULL
Id int(11) NO MUL NULL
Chunk varchar(255) NO MUL NULL
ChunkType varchar(255) NO MUL NULL
EDIT 3:
The end objective of the query is to create a table of word co-occurrences across documents. ChunkIDs are word instances. Each instance is a word that is associated with a particular document (ID). About 40 words present per document. About 1 million documents. So the resulting table of co-occurrences is highly compressed compared to the full cross-product temporary table that is (apparently) being created. That is, the full cross-product temp table is 1 mil * 40 * 40 = 1.6 billion rows. The compressed resulting table is estimated at about 40 million rows.
EDIT 4:
Adding postgresql tag to see if any postgresql users can get a better execution plan on that SQL implementation. If that's the case, I'll switch over.
Updated with a query that produces the same results. It won't be any faster though.
Create Index IX_ID On Chunks (ID);
Select
LeftChunk,
LeftChunkHash,
RightChunk,
RightChunkHash,
Sum(ChunkCount)
From (
Select
t.Chunk as LeftChunk,
t.ChunkHash as LeftChunkHash,
q.Chunk as RightChunk,
q.ChunkHash as RightChunkHash,
count(t.ChunkHash) as ChunkCount
From
chunks as t
inner join
chunks as q
on t.ID = q.ID
Group By
t.ID,
t.ChunkHash,
q.ChunkHash
) x
Group By
LeftChunk,
LeftChunkHash,
RightChunk,
RightChunkHash
Fiddle with example test data http://sqlfiddle.com/#!3/ea1a5/2
Latest Fiddle, with the problem reformulated as words and documents: http://sqlfiddle.com/#!3/f5aef/12
With the problem reformulated as documents and words, how many documents do you have, how many words, and how many document words?
Also, using the documents and words analogy, would you say your query is "For all pairs of words that appear in a document together, how often do they appear together in any document. If word A appears n
times in a document and word B m
times in the same document, then this counts as n * m
times in the total."
How about summarizing the table before the join?
The summary might be:
select count(*) count,
Chunk,
ChunkHash
from chunks
group by Chunk, ChunkHash
Then the join would be:
Select r.Chunk as RightChunk,
r.ChunkHash as RightChunkHash,
l.Chunk as LeftChunk,
l.ChunkHash as LeftChunkHash
sum (l.Count) + sum(r.Count) as Count
from (
select count(*) count,
Chunk,
ChunkHash
from chunks
group by Chunk, ChunkHash
) l
join (
select count(*) count,
Chunk,
ChunkHash
from chunks
group by Chunk, ChunkHash
) r on l.Chunk = r.Chunk
group by r.Chunk, r.ChunkHash, l.Chunk, l.ChunkHash
The thing I'm not sure about is what you're counting, exactly. So my SUM() + SUM() is a guess. You might want SUM() * SUM().
Also, I'm assuming that two Chunk values are equal if and only if ChunkHash values are equal.
I migrated from MySQL to PostgreSQL, and query execution time went from ~1.5 days to ~10 mins.
Here's the PostgreSQL query execution plan:
I am no longer using MySQL.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.