I'm essentially trying to perform an sql left join in F# taking in two csv-files and producing a third. My files are not huge (around 200k rows) but I still experience horrific performance - in fact it is faster using vlookup in xl...
The csv's both have an identifier "column" which holds compatible values but a value present in one csv is not guaranteed to be in the other.
I have tinkered with it, and I suspect the search through one csv for every row in the other is the killer.
EDIT: Replacing Array with Map improves performance a lot. But I guess this can still be improved further
Any ideas for improving this?
Some (pseudo) code:
open FSharp.Data
type DataLeft =
CsvProvider<Sample = "identifier;var1;var2", AssumeMissingValues = true, Schema = "identifier (string), var1, var2", Separators=";", HasHeaders=true, Encoding="UTF-8">
type DataRight =
CsvProvider<Sample = "identifier;var3;var4", AssumeMissingValues = true, Schema = "identifier (string), var3 (float option), var4 (float option)", Separators=";", HasHeaders=true, Encoding="UTF-8">
type Output =
CsvProvider<Sample = "identifier;var1;var2;var3;var4", AssumeMissingValues = true, Schema = "identifier (string), var1, var2, var3 (float option), var4 (float option)", Separators=";", HasHeaders=true, Encoding="UTF-8">
let leftRows = DataLeft.Load(leftPath).Rows
// (slightly) more efficient to convert to array
let rightRows = DataRight.Load(rightPath).Rows |> Seq.toArray
**EDIT: let rightRows = DataRight.Load(rightPath).Rows |> Seq.map (fun row -> (row.Identifier, row)) |> Map.ofSeq**
let getMissingVars (row : DataLeft.Row) =
let id = row.Identifier
let rightRow = rightRows |> Array.tryFind (fun rRow -> rRow.Identifier = id)
**EDIT: let rightRow = rightRows.TryFind(id)**
match rightRow with
| None ->
Output.Row(
id,
row.Var1,
row.Var2,
None,
None)
| Some realRow ->
Output.Row(
id,
row.Var1,
row.Var2,
realRow.Var3,
realRow.Var4)
let rows = leftRows |> Seq.map getMissingVars
let csv = new Output(rows)
csv.Save(path = "outputPath")
Simply creating a dictionary for lookups solved the issue. I gave up trying to improve further, so posting this as answer.
As per edit replace
let rightRows = DataRight.Load(rightPath).Rows |> Seq.toArray
with
let rightRows =
DataRight.Load(rightPath).Rows
|> Seq.map (fun row -> (row.Identifier, row))
|> Map.ofSeq
or some nicer dictionary. Then replace Array.tryFind
with Map.tryFind
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.