提高Ruby脚本处理CSV的性能

Question

I've written a Ruby script to do the following: 我编写了一个Ruby脚本来执行以下操作：

Read a very large (2GB/12,500,000 lines) CSV into SQLite3 将非常大的（2GB / 12,500,000行）CSV读入SQLite3
Query the db 查询数据库
Output results to a new CSV 将结果输出到新CSV

In my mind, this seems to be the easiest and most logical way to go about this. 在我看来，这似乎是最容易和最合理的方式。 This process will need to be configurable and repeated periodically, hence the Script. 这个过程需要可以配置并定期重复，因此脚本。 I'm using SQLite because the data will always come in CSV form (no access to original DB) and it's just easier to offload the processing to an (easily changeable) SQL statement. 我正在使用SQLite，因为数据将始终以CSV格式（无法访问原始数据库），并且将处理卸载到（容易更改的）SQL语句更容易。

The problem is that steps 1 and 2 take such a long time. 问题是步骤1和2需要很长时间。 I've searched for ways to improve the performance of SQLite . 我一直在寻找提高SQLite性能的方法。 I've implemented some of these suggestions, with limited success. 我已经实施了其中一些建议，但收效甚微。

In-memory instance of SQLite3 SQLite3的内存中实例
Use transaction (around step 1) 使用交易（第1步）
Use a prepared statement 使用准备好的声明
PRAGMA synchronous = OFF
PRAGMA journal_mode = MEMORY (not sure if this help when using in-memory DB) PRAGMA journal_mode = MEMORY （在使用内存数据库时不确定这是否有帮助）

After all these, I get the following times: 完成所有这些后，我得到以下时间：

Read time: 17m 28s 阅读时间：17分28秒
Query time: 14m 26s 查询时间：14分26秒
Write time: 0m 4s 写入时间：0分4秒
Elapsed time: 31m 58s 经过的时间：31分58秒

Granted that I'm using a different language to the post mentioned above and there are differences such as compiled/interpreted, however the insert times are approx 79,000 vs 12,000 record/second - That's 6x slower. 假设我使用的语言不同于上面提到的帖子，并且存在编译/解释等差异，但插入时间约为79,000对比12,000记录/秒 - 这比6倍慢。

I've also tried indexing the some (or all) of the fields. 我也试过索引一些（或所有）字段。 This actually has the opposite effect. 这实际上具有相反的效果。 The indexing takes so long that any improvement in query time is completely overshadowed by the indexing time. 索引花费的时间太长，以至于查询时间的任何改进都完全被索引时间所掩盖。 Additionally, doing that in-memory DB eventually leads to an out of memory error due to the extra space required. 此外，由于需要额外的空间，执行内存数据库最终会导致内存不足错误。

Is SQLite3 not the right DB for this amount of data? SQLite3不是这个数据量的正确数据库吗？ I've tried the same using MySQL, but its performance was even worse. 我尝试过使用MySQL，但性能更差。

Lastly, here's a chopped down version of the code (some irrelevant niceties removed). 最后，这是一个严格的代码版本（删除了一些无关的细节）。

require 'csv'
require 'sqlite3'

inputFile = ARGV[0]
outputFile = ARGV[1]
criteria1 = ARGV[2]
criteria2 = ARGV[3]
criteria3 = ARGV[4]

begin
    memDb = SQLite3::Database.new ":memory:"
    memDb.execute "PRAGMA synchronous = OFF"
    memDb.execute "PRAGMA journal_mode = MEMORY"

    memDb.execute "DROP TABLE IF EXISTS Area"
    memDb.execute "CREATE TABLE IF NOT EXISTS Area (StreetName TEXT, StreetType TEXT, Locality TEXT, State TEXT, PostCode INTEGER, Criteria1 REAL, Criteria2 REAL, Criteria3 REAL)" 
    insertStmt = memDb.prepare "INSERT INTO Area VALUES(?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8)"

    # Read values from file
    readCounter = 0
    memDb.execute "BEGIN TRANSACTION"
    blockReadTime = Time.now
    CSV.foreach(inputFile) { |line|

        readCounter += 1
        break if readCounter > 100000
        if readCounter % 10000 == 0
            formattedReadCounter = readCounter.to_s.reverse.gsub(/...(?=.)/,'\&,').reverse
            print "\rReading line #{formattedReadCounter} (#{Time.now - blockReadTime}s)     " 
            STDOUT.flush
            blockReadTime = Time.now
        end

        insertStmt.execute (line[6]||"").gsub("'", "''"), (line[7]||"").gsub("'", "''"), (line[9]||"").gsub("'", "''"), line[10], line[11], line[12], line[13], line[14]
    }
    memDb.execute "END TRANSACTION"
    insertStmt.close

    # Process values
    sqlQuery = <<eos
    SELECT DISTINCT
        '*',
        '*',
        Locality,
        State,
        PostCode
    FROM
        Area
    GROUP BY
        Locality,
        State,
        PostCode
    HAVING
        MAX(Criteria1) <= #{criteria1}
        AND
        MAX(Criteria2) <= #{criteria2}
        AND
        MAX(Criteria3) <= #{criteria3}
    UNION
    SELECT DISTINCT
        StreetName,
        StreetType,
        Locality,
        State,
        PostCode
    FROM
        Area
    WHERE
        Locality NOT IN (
            SELECT
                Locality
            FROM
                Area
            GROUP BY
                Locality
            HAVING
                MAX(Criteria1) <= #{criteria1}
                AND
                MAX(Criteria2) <= #{criteria2}
                AND
                MAX(Criteria3) <= #{criteria3}
            )
    GROUP BY
        StreetName,
        StreetType,
        Locality,
        State,
        PostCode
    HAVING
        MAX(Criteria1) <= #{criteria1}
        AND
        MAX(Criteria2) <= #{criteria2}
        AND
        MAX(Criteria3) <= #{criteria3}
eos
    statement = memDb.prepare sqlQuery

    # Output to CSV
    csvFile = CSV.open(outputFile, "wb")
    resultSet = statement.execute
    resultSet.each { |row|  csvFile << row}
    csvFile.close

rescue SQLite3::Exception => ex
    puts "Excepion occurred: #{ex}"
ensure
    statement.close if statement
    memDb.close if memDb
end

Please feel free to poke fun at my naive Ruby coding - what don't kill me will hopefully make me a stronger coder. 请随意嘲笑我天真的Ruby编码 - 什么不杀我，希望能让我成为一个更强大的编码器。

Answer 1

In general, you should try UNION ALL instead of UNION , if possible, so that the two subqueries do not have to be checked for duplicates. 通常，如果可能，应该尝试使用UNION ALL而不是UNION ，这样就不必检查两个子查询是否有重复项。 However, in this case, SQLite then has to execute the DISTINCT in a separate step. 但是，在这种情况下，SQLite必须在单独的步骤中执行DISTINCT 。 Whether this is faster or not depends on your data. 这是否更快取决于您的数据。

According to my EXPLAIN QUERY PLAN experiments, the following two indexes should help most with this query: 根据我的EXPLAIN QUERY PLAN实验，以下两个索引应该对此查询有所帮助：

CREATE INDEX i1 ON Area(Locality, State, PostCode);
CREATE INDEX i2 ON Area(StreetName, StreetType, Locality, State, PostCode);

提高Ruby脚本处理CSV的性能

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-02-01 08:14:32

提高Ruby脚本处理CSV的性能

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-02-01 08:14:32

解决方案1
1 已采纳 2013-02-01 08:14:32