简体   繁体   中英

PowerShell on CSV file - looking for string depending on string

I need your help regarding PowerShell programming on CSV file.

I've made some searches but cannot find what I'm looking for (or perhaps I don't know the technical terms). Basically, I have an Excel workbook with large amount of data (more or less 38 columns x 350.000 rows), and there are a couple of formulas that take hours to calculate.

I was first wondering if PowerShell could speed up a bit the calculation compared to Excel. The calculations taking most of my time are in fact not that complex (at least at first glance). My data is more or less constructed like this:

Ref      Title
-----    --------------------------
A/001    "free_text"
A/002    "free_text A/001 free_text"
...      ...
A/005    "free_text A/004 free_text"
A/006    "free_text"
B/001    "free_text" 
B/002    "free_text"
C/001    "free_text"
C/002    "free_text"
...
C/050    "free_text C/047 free_text"
...      ...
C/103    "free_text"
D/001    "free_text"
D/002    "free_text D/001 free_text"
...      ....

Basically the data is as follows:

  1. the Ref field contains unique values, in {letter}/{incremental value} format.
  2. In some rows, the Title field may call up one of the Ref data. For example, in line 2, the Title calls for the A/001 Ref . In the last row, the Title calls for the D/001 Ref , etc.
  3. There is no logic pattern defining when this ref could be called up in a title. This is random.

However, what I'm 100% sure of is the following:

  1. The Ref called in the Title is always belonging to the same {letter} block. For example: the string 'C/047' in the Title field can only be found in the block where the Ref {letter} is C.
  2. The Ref called in the Title will always be located 'after' (or in a lower row) than the Ref it refers to. In other words, I cannot have a line with following pattern:

    \nRef Title \n------------ ----------------------------------------- \n{letter/i} {free_text {letter/j} free_text} with j<i \n

    → This is not possible.
    → j is always > i

I've used these characteristics in Excel to minimize my lookup arrays. But it still takes an hour to calculate everything.

I've therefore looked into PowerShell, and started to 'play' a bit with the CSV, and looping with the ForEach-Object hoping I would have quicker results. Up to now I basically ended-up looping twice on my CSV file.

$CSV1 = myfile.csv
$CSV2 = myfile.csv

$CSV1 | ForEach-Object {
    # find Title
    $TitSearch = $_.$Ref
    $CSV2 | ForEach-Object {
        if ($_.$Title -eq $TitSearch) {
            myinstructions
        }
    }
}

It works but it's really really really long. So I then tried the following instead of using the $CSV2 | ForEach... $CSV2 | ForEach... :

$CSV | where {$_.$Title -eq $TitleSearch} | % $Ref

In either case, it's too long and not efficient at all. Additionally with these 2 solutions, I'm not using above characteristics which could reduce the lookup array and as already stated, it seems I end up looping twice on the CSV file from its beginning up to the end.

Questions:

  1. Is there a leaner way to do this?
  2. Am I wasting my time with PowerShell?
  3. I though about creating 1 file per Ref {letter} block (1 file for block A, 1 for B, etc...). However I have about 50.000 blocks to create. Or create them one by one, carry out the analysis, put the results in a new file, and delete them. Would that be quicker?

Note: this is for work, to be used by other colleagues, and Excel and PowerShell are really the only softwares we may use. I know VBA but ok... At the end I'm curious about how and if this can be solved in a simple manner using PowerShell.

As far as I can see your base algorithm do N^2 iteration (~120 billion). There is a standard way to make it efficient - you need to build a hashtable first. Hashtable is a key/value storage, and look up is pretty much instantaneous, so algorithm's time complexity will become ~N. Powershell has built-in data type for that. In your case the key would be ref, and the value an array of cell data (assuming your table is smth like: ref, title, col1, ..., colN)

$hash = @{}
foreach($row in $table} {$hash.Add($row.ref, @($row.title, $row.col1, ...)}
#it will take 350K steps to generate it
#then you can iterate over it again
foreach($key in $hash.Keys) { 
 $key # access current ref
 $rowData = $hash.$key # access to current row elements (by index)
 $refRowData = $hash[$rowData[$j]] # lookup from other rows, assuming lookup reference is in some column
}

So it's a general idea how to solve the time issue. To be honest I don't believe you need to recreate a wheel and code it yourself. What you need is a relational database. Since you have excel, you should have MS ACCESS too. Just import your data in there, make ref and title an index, then all you need to do is self join. MS Access suck, but I'm sure it will handle 350K row just fine. Ideally you'd need to get a database on some corporate MSSQL server (open a ticket, talk to your manger, etc). It will calculate all that in seconds, and then you can link the output to a spreadsheet as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM