简体   繁体   English

在CSV文件上运行PowerShell-根据字符串查找字符串

[英]PowerShell on CSV file - looking for string depending on string

I need your help regarding PowerShell programming on CSV file. 我需要有关在CSV文件上进行PowerShell编程的帮助。

I've made some searches but cannot find what I'm looking for (or perhaps I don't know the technical terms). 我进行了一些搜索,但找不到所需的内容(或者我不知道技术术语)。 Basically, I have an Excel workbook with large amount of data (more or less 38 columns x 350.000 rows), and there are a couple of formulas that take hours to calculate. 基本上,我有一个包含大量数据(或多或少38列x 350.000行)的Excel工作簿,并且有几个公式需要几个小时才能计算出来。

I was first wondering if PowerShell could speed up a bit the calculation compared to Excel. 我首先想知道,与Excel相比,PowerShell是否可以加快计算速度。 The calculations taking most of my time are in fact not that complex (at least at first glance). 实际上,占用我大部分时间的计算并不那么复杂(至少乍一看)。 My data is more or less constructed like this: 我的数据或多或少是这样构造的:

Ref      Title
-----    --------------------------
A/001    "free_text"
A/002    "free_text A/001 free_text"
...      ...
A/005    "free_text A/004 free_text"
A/006    "free_text"
B/001    "free_text" 
B/002    "free_text"
C/001    "free_text"
C/002    "free_text"
...
C/050    "free_text C/047 free_text"
...      ...
C/103    "free_text"
D/001    "free_text"
D/002    "free_text D/001 free_text"
...      ....

Basically the data is as follows: 基本上数据如下:

  1. the Ref field contains unique values, in {letter}/{incremental value} format. Ref字段包含唯一值,格式为{letter}/{incremental value}
  2. In some rows, the Title field may call up one of the Ref data. 在某些行中,“ 标题”字段可以调用“ 引用”数据之一。 For example, in line 2, the Title calls for the A/001 Ref . 例如,在第2行中, 标题要求使用A / 001 Ref In the last row, the Title calls for the D/001 Ref , etc. 在最后一行, 标题要求D / 001 Ref等。
  3. There is no logic pattern defining when this ref could be called up in a title. 没有逻辑模式定义何时可以在标题中调用此引用。 This is random. 这是随机的。

However, what I'm 100% sure of is the following: 但是,我100%确信以下几点:

  1. The Ref called in the Title is always belonging to the same {letter} block. 标题中调用的Ref始终属于同一{letter}块。 For example: the string 'C/047' in the Title field can only be found in the block where the Ref {letter} is C. 例如:“ 标题”字段中的字符串“ C / 047”只能在Ref {letter}为C的块中找到。
  2. The Ref called in the Title will always be located 'after' (or in a lower row) than the Ref it refers to. 标题中 引用的Ref将始终位于其引用的Ref的 “之后”(或较低的行)中。 In other words, I cannot have a line with following pattern: 换句话说,我不能使用以下格式的行:

    \nRef Title 参考标题\n------------ ----------------------------------------- ------------ -------------------------------------- ---\n{letter/i} {free_text {letter/j} free_text} with j<i {jetter / i} {free_text {letter / j} free_text}且j <i\n

    → This is not possible. →这是不可能的。
    → j is always > i →j总是> i

I've used these characteristics in Excel to minimize my lookup arrays. 我已在Excel中使用这些特征来最大程度地减少查找数组。 But it still takes an hour to calculate everything. 但是计算所有内容仍需要一个小时。

I've therefore looked into PowerShell, and started to 'play' a bit with the CSV, and looping with the ForEach-Object hoping I would have quicker results. 因此,我研究了PowerShell,并开始使用CSV进行“播放”,并使用ForEach-Object循环,希望得到更快的结果。 Up to now I basically ended-up looping twice on my CSV file. 到目前为止,我基本上结束了对CSV文件的两次循环。

$CSV1 = myfile.csv
$CSV2 = myfile.csv

$CSV1 | ForEach-Object {
    # find Title
    $TitSearch = $_.$Ref
    $CSV2 | ForEach-Object {
        if ($_.$Title -eq $TitSearch) {
            myinstructions
        }
    }
}

It works but it's really really really long. 它可以工作,但是真的很长。 So I then tried the following instead of using the $CSV2 | ForEach... 因此,我尝试了以下方法,而不是使用$CSV2 | ForEach... $CSV2 | ForEach... : $CSV2 | ForEach...

$CSV | where {$_.$Title -eq $TitleSearch} | % $Ref

In either case, it's too long and not efficient at all. 无论哪种情况,它都太长并且根本没有效率。 Additionally with these 2 solutions, I'm not using above characteristics which could reduce the lookup array and as already stated, it seems I end up looping twice on the CSV file from its beginning up to the end. 此外,使用这两种解决方案时,我没有使用上述特征,因为它们可能会减少查找数组,并且如上所述,看来我最终在CSV文件中从开始到结束都循环了两次。

Questions: 问题:

  1. Is there a leaner way to do this? 有更精简的方法吗?
  2. Am I wasting my time with PowerShell? 我在浪费时间在PowerShell上吗?
  3. I though about creating 1 file per Ref {letter} block (1 file for block A, 1 for B, etc...). 我虽然要为每个Ref {letter}块创建1个文件(A块1个文件,B 1个文件等等)。 However I have about 50.000 blocks to create. 但是我有大约50.000块要创建。 Or create them one by one, carry out the analysis, put the results in a new file, and delete them. 或一个一个地创建它们,进行分析,将结果放入一个新文件中,然后删除它们。 Would that be quicker? 这样会更快吗?

Note: this is for work, to be used by other colleagues, and Excel and PowerShell are really the only softwares we may use. 注意:这是工作,供其他同事使用,而Excel和PowerShell实际上是我们可能会使用的唯一软件。 I know VBA but ok... At the end I'm curious about how and if this can be solved in a simple manner using PowerShell. 我知道VBA,但还可以...最后,我很好奇如何以及是否可以使用PowerShell以简单的方式解决此问题。

As far as I can see your base algorithm do N^2 iteration (~120 billion). 据我所知,您的基本算法进行了N ^ 2次迭代(约1,200亿)。 There is a standard way to make it efficient - you need to build a hashtable first. 有一种提高效率的标准方法-您需要先构建一个哈希表。 Hashtable is a key/value storage, and look up is pretty much instantaneous, so algorithm's time complexity will become ~N. Hashtable是键/值存储,并且查找几乎是瞬时的,因此算法的时间复杂度将变为〜N。 Powershell has built-in data type for that. Powershell为此提供了内置数据类型。 In your case the key would be ref, and the value an array of cell data (assuming your table is smth like: ref, title, col1, ..., colN) 在您的情况下,键将是ref,并且值是单元格数据数组(假设您的表像是smth一样:ref,title,col1,...,colN)

$hash = @{}
foreach($row in $table} {$hash.Add($row.ref, @($row.title, $row.col1, ...)}
#it will take 350K steps to generate it
#then you can iterate over it again
foreach($key in $hash.Keys) { 
 $key # access current ref
 $rowData = $hash.$key # access to current row elements (by index)
 $refRowData = $hash[$rowData[$j]] # lookup from other rows, assuming lookup reference is in some column
}

So it's a general idea how to solve the time issue. 因此,解决时间问题是一个普遍的想法。 To be honest I don't believe you need to recreate a wheel and code it yourself. 老实说,我不认为您需要重新创建轮子并自己编写代码。 What you need is a relational database. 您需要一个关系数据库。 Since you have excel, you should have MS ACCESS too. 既然您具有Excel,那么您也应该拥有MS ACCESS。 Just import your data in there, make ref and title an index, then all you need to do is self join. 只需将您的数据导入那里,使ref和title成为索引,那么您要做的就是自我联接。 MS Access suck, but I'm sure it will handle 350K row just fine. MS Access很烂,但是我敢肯定它可以处理350K行。 Ideally you'd need to get a database on some corporate MSSQL server (open a ticket, talk to your manger, etc). 理想情况下,您需要在某些公司MSSQL服务器上获得一个数据库(打开票证,与您的经理交谈,等等)。 It will calculate all that in seconds, and then you can link the output to a spreadsheet as well. 它将以秒为单位计算所有内容,然后您也可以将输出链接到电子表格。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM