[英]Improve performance of Ruby script processing CSV
I've written a Ruby script to do the following: 我编写了一个Ruby脚本来执行以下操作:
In my mind, this seems to be the easiest and most logical way to go about this. 在我看来,这似乎是最容易和最合理的方式。 This process will need to be configurable and repeated periodically, hence the Script.
这个过程需要可以配置并定期重复,因此脚本。 I'm using SQLite because the data will always come in CSV form (no access to original DB) and it's just easier to offload the processing to an (easily changeable) SQL statement.
我正在使用SQLite,因为数据将始终以CSV格式(无法访问原始数据库),并且将处理卸载到(容易更改的)SQL语句更容易。
The problem is that steps 1 and 2 take such a long time. 问题是步骤1和2需要很长时间。 I've searched for ways to improve the performance of SQLite .
我一直在寻找提高SQLite性能的方法 。 I've implemented some of these suggestions, with limited success.
我已经实施了其中一些建议,但收效甚微。
PRAGMA synchronous = OFF
PRAGMA journal_mode = MEMORY
(not sure if this help when using in-memory DB) PRAGMA journal_mode = MEMORY
(在使用内存数据库时不确定这是否有帮助) After all these, I get the following times: 完成所有这些后,我得到以下时间:
Granted that I'm using a different language to the post mentioned above and there are differences such as compiled/interpreted, however the insert times are approx 79,000 vs 12,000 record/second - That's 6x slower. 假设我使用的语言不同于上面提到的帖子,并且存在编译/解释等差异,但插入时间约为79,000对比12,000记录/秒 - 这比6倍慢。
I've also tried indexing the some (or all) of the fields. 我也试过索引一些(或所有)字段。 This actually has the opposite effect.
这实际上具有相反的效果。 The indexing takes so long that any improvement in query time is completely overshadowed by the indexing time.
索引花费的时间太长,以至于查询时间的任何改进都完全被索引时间所掩盖。 Additionally, doing that in-memory DB eventually leads to an out of memory error due to the extra space required.
此外,由于需要额外的空间,执行内存数据库最终会导致内存不足错误。
Is SQLite3 not the right DB for this amount of data? SQLite3不是这个数据量的正确数据库吗? I've tried the same using MySQL, but its performance was even worse.
我尝试过使用MySQL,但性能更差。
Lastly, here's a chopped down version of the code (some irrelevant niceties removed). 最后,这是一个严格的代码版本(删除了一些无关的细节)。
require 'csv'
require 'sqlite3'
inputFile = ARGV[0]
outputFile = ARGV[1]
criteria1 = ARGV[2]
criteria2 = ARGV[3]
criteria3 = ARGV[4]
begin
memDb = SQLite3::Database.new ":memory:"
memDb.execute "PRAGMA synchronous = OFF"
memDb.execute "PRAGMA journal_mode = MEMORY"
memDb.execute "DROP TABLE IF EXISTS Area"
memDb.execute "CREATE TABLE IF NOT EXISTS Area (StreetName TEXT, StreetType TEXT, Locality TEXT, State TEXT, PostCode INTEGER, Criteria1 REAL, Criteria2 REAL, Criteria3 REAL)"
insertStmt = memDb.prepare "INSERT INTO Area VALUES(?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8)"
# Read values from file
readCounter = 0
memDb.execute "BEGIN TRANSACTION"
blockReadTime = Time.now
CSV.foreach(inputFile) { |line|
readCounter += 1
break if readCounter > 100000
if readCounter % 10000 == 0
formattedReadCounter = readCounter.to_s.reverse.gsub(/...(?=.)/,'\&,').reverse
print "\rReading line #{formattedReadCounter} (#{Time.now - blockReadTime}s) "
STDOUT.flush
blockReadTime = Time.now
end
insertStmt.execute (line[6]||"").gsub("'", "''"), (line[7]||"").gsub("'", "''"), (line[9]||"").gsub("'", "''"), line[10], line[11], line[12], line[13], line[14]
}
memDb.execute "END TRANSACTION"
insertStmt.close
# Process values
sqlQuery = <<eos
SELECT DISTINCT
'*',
'*',
Locality,
State,
PostCode
FROM
Area
GROUP BY
Locality,
State,
PostCode
HAVING
MAX(Criteria1) <= #{criteria1}
AND
MAX(Criteria2) <= #{criteria2}
AND
MAX(Criteria3) <= #{criteria3}
UNION
SELECT DISTINCT
StreetName,
StreetType,
Locality,
State,
PostCode
FROM
Area
WHERE
Locality NOT IN (
SELECT
Locality
FROM
Area
GROUP BY
Locality
HAVING
MAX(Criteria1) <= #{criteria1}
AND
MAX(Criteria2) <= #{criteria2}
AND
MAX(Criteria3) <= #{criteria3}
)
GROUP BY
StreetName,
StreetType,
Locality,
State,
PostCode
HAVING
MAX(Criteria1) <= #{criteria1}
AND
MAX(Criteria2) <= #{criteria2}
AND
MAX(Criteria3) <= #{criteria3}
eos
statement = memDb.prepare sqlQuery
# Output to CSV
csvFile = CSV.open(outputFile, "wb")
resultSet = statement.execute
resultSet.each { |row| csvFile << row}
csvFile.close
rescue SQLite3::Exception => ex
puts "Excepion occurred: #{ex}"
ensure
statement.close if statement
memDb.close if memDb
end
Please feel free to poke fun at my naive Ruby coding - what don't kill me will hopefully make me a stronger coder. 请随意嘲笑我天真的Ruby编码 - 什么不杀我,希望能让我成为一个更强大的编码器。
In general, you should try UNION ALL
instead of UNION
, if possible, so that the two subqueries do not have to be checked for duplicates. 通常,如果可能,应该尝试使用
UNION ALL
而不是UNION
,这样就不必检查两个子查询是否有重复项。 However, in this case, SQLite then has to execute the DISTINCT
in a separate step. 但是,在这种情况下,SQLite必须在单独的步骤中执行
DISTINCT
。 Whether this is faster or not depends on your data. 这是否更快取决于您的数据。
According to my EXPLAIN QUERY PLAN
experiments, the following two indexes should help most with this query: 根据我的
EXPLAIN QUERY PLAN
实验,以下两个索引应该对此查询有所帮助:
CREATE INDEX i1 ON Area(Locality, State, PostCode);
CREATE INDEX i2 ON Area(StreetName, StreetType, Locality, State, PostCode);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.