I've written a Ruby script to do the following:
In my mind, this seems to be the easiest and most logical way to go about this. This process will need to be configurable and repeated periodically, hence the Script. I'm using SQLite because the data will always come in CSV form (no access to original DB) and it's just easier to offload the processing to an (easily changeable) SQL statement.
The problem is that steps 1 and 2 take such a long time. I've searched for ways to improve the performance of SQLite . I've implemented some of these suggestions, with limited success.
PRAGMA synchronous = OFF
PRAGMA journal_mode = MEMORY
(not sure if this help when using in-memory DB) After all these, I get the following times:
Granted that I'm using a different language to the post mentioned above and there are differences such as compiled/interpreted, however the insert times are approx 79,000 vs 12,000 record/second - That's 6x slower.
I've also tried indexing the some (or all) of the fields. This actually has the opposite effect. The indexing takes so long that any improvement in query time is completely overshadowed by the indexing time. Additionally, doing that in-memory DB eventually leads to an out of memory error due to the extra space required.
Is SQLite3 not the right DB for this amount of data? I've tried the same using MySQL, but its performance was even worse.
Lastly, here's a chopped down version of the code (some irrelevant niceties removed).
require 'csv'
require 'sqlite3'
inputFile = ARGV[0]
outputFile = ARGV[1]
criteria1 = ARGV[2]
criteria2 = ARGV[3]
criteria3 = ARGV[4]
begin
memDb = SQLite3::Database.new ":memory:"
memDb.execute "PRAGMA synchronous = OFF"
memDb.execute "PRAGMA journal_mode = MEMORY"
memDb.execute "DROP TABLE IF EXISTS Area"
memDb.execute "CREATE TABLE IF NOT EXISTS Area (StreetName TEXT, StreetType TEXT, Locality TEXT, State TEXT, PostCode INTEGER, Criteria1 REAL, Criteria2 REAL, Criteria3 REAL)"
insertStmt = memDb.prepare "INSERT INTO Area VALUES(?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8)"
# Read values from file
readCounter = 0
memDb.execute "BEGIN TRANSACTION"
blockReadTime = Time.now
CSV.foreach(inputFile) { |line|
readCounter += 1
break if readCounter > 100000
if readCounter % 10000 == 0
formattedReadCounter = readCounter.to_s.reverse.gsub(/...(?=.)/,'\&,').reverse
print "\rReading line #{formattedReadCounter} (#{Time.now - blockReadTime}s) "
STDOUT.flush
blockReadTime = Time.now
end
insertStmt.execute (line[6]||"").gsub("'", "''"), (line[7]||"").gsub("'", "''"), (line[9]||"").gsub("'", "''"), line[10], line[11], line[12], line[13], line[14]
}
memDb.execute "END TRANSACTION"
insertStmt.close
# Process values
sqlQuery = <<eos
SELECT DISTINCT
'*',
'*',
Locality,
State,
PostCode
FROM
Area
GROUP BY
Locality,
State,
PostCode
HAVING
MAX(Criteria1) <= #{criteria1}
AND
MAX(Criteria2) <= #{criteria2}
AND
MAX(Criteria3) <= #{criteria3}
UNION
SELECT DISTINCT
StreetName,
StreetType,
Locality,
State,
PostCode
FROM
Area
WHERE
Locality NOT IN (
SELECT
Locality
FROM
Area
GROUP BY
Locality
HAVING
MAX(Criteria1) <= #{criteria1}
AND
MAX(Criteria2) <= #{criteria2}
AND
MAX(Criteria3) <= #{criteria3}
)
GROUP BY
StreetName,
StreetType,
Locality,
State,
PostCode
HAVING
MAX(Criteria1) <= #{criteria1}
AND
MAX(Criteria2) <= #{criteria2}
AND
MAX(Criteria3) <= #{criteria3}
eos
statement = memDb.prepare sqlQuery
# Output to CSV
csvFile = CSV.open(outputFile, "wb")
resultSet = statement.execute
resultSet.each { |row| csvFile << row}
csvFile.close
rescue SQLite3::Exception => ex
puts "Excepion occurred: #{ex}"
ensure
statement.close if statement
memDb.close if memDb
end
Please feel free to poke fun at my naive Ruby coding - what don't kill me will hopefully make me a stronger coder.
In general, you should try UNION ALL
instead of UNION
, if possible, so that the two subqueries do not have to be checked for duplicates. However, in this case, SQLite then has to execute the DISTINCT
in a separate step. Whether this is faster or not depends on your data.
According to my EXPLAIN QUERY PLAN
experiments, the following two indexes should help most with this query:
CREATE INDEX i1 ON Area(Locality, State, PostCode);
CREATE INDEX i2 ON Area(StreetName, StreetType, Locality, State, PostCode);
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.