简体   繁体   English

优化脚本

[英]Optimizing a script

Info 信息

I've created a script which analyzes the debug logs from Windows DNS Server. 我创建了一个脚本,用于分析Windows DNS服务器中的调试日志。

It does the following: 它执行以下操作:

  1. Open debug log using [System.IO.File] class 使用[System.IO.File]类打开调试日志
  2. Perform a regex match on each line 在每行上执行正则表达式匹配
  3. Separate 16 capture groups into different properties inside a custom object 在自定义对象中将16个捕获组分成不同的属性
  4. Fills dictionaries and appends to the value of each key to produce statistics 填充字典并附加到每个键的值以产生统计信息

Steps 1 and 2 take the longest. 步骤1和2耗时最长。 In fact, they take a seemingly endless amount of time, because the file is growing as it is being read. 实际上,它们花费了看似无休止的时间,因为文件在读取时会不断增长。

Problem 问题

Due to the size of the debug log (80,000kb) it takes a very long time. 由于调试日志的大小(80,000kb),因此需要很长时间。

I believe that my code is fine for smaller text files, but it fails to deal with much larger files. 我相信我的代码适用于较小的文本文件,但无法处理较大的文件。

Code

Here is my code: https://github.com/cetanu/msDnsStats/blob/master/msdnsStats.ps1 这是我的代码: https : //github.com/cetanu/msDnsStats/blob/master/msdnsStats.ps1

Debug log preview 调试日志预览

This is what the debug looks like (including the blank lines) 这就是调试的样子(包括空行)

Multiply this by about 100,000,000 and you have my debug log. 乘以大约100,000,000 ,您便得到了我的调试日志。

21/03/2014 2:20:03 PM 0D0C PACKET  0000000005FCB280 UDP Rcv 202.90.34.177   3709   Q [1001   D   NOERROR] A      (2)up(13)massrelevance(3)com(0)

21/03/2014 2:20:03 PM 0D0C PACKET  00000000042EB8B0 UDP Rcv 67.215.83.19    097f   Q [0000       NOERROR] CNAME  (15)manchesterunity(3)org(2)au(0)

21/03/2014 2:20:03 PM 0D0C PACKET  0000000003131170 UDP Rcv 62.36.4.166     a504   Q [0001   D   NOERROR] A      (3)ekt(4)user(7)net0319(3)com(0)

21/03/2014 2:20:03 PM 0D0C PACKET  00000000089F1FD0 UDP Rcv 80.10.201.71    3e08   Q [1000       NOERROR] A      (4)dns1(5)offis(3)com(2)au(0)

Request 请求

I need ways or ideas on how to open and read each line of a file more quickly than what I am doing now. 与现在相比,我需要一些方法或想法来更快地打开和读取文件的每一行。

I am open to suggestions of using a different language. 我愿意接受使用其他语言的建议。

I would trade this: 我会交易这个:

$dnslog = [System.IO.File]::Open("c:\dns.log","Open","Read","ReadWrite")
$dnslog_content = New-Object System.IO.StreamReader($dnslog)


For ($i=0;$i -lt $dnslog.length; $i++)
{


    $line = $dnslog_content.readline()
    if ($line -eq $null) { continue }


    # REGEX MATCH EACH LINE OF LOGFILE
    $pattern = $line | select-string -pattern $regex



    # IGNORE EMPTY MATCH
    if ($pattern -eq $null) {
            continue
    }

for this: 为了这:

Get-Content 'c:\dns.log' -ReadCount 1000 |
 ForEach-Object {
   foreach ($line in $_)
    {
      if ($line -match $regex)
       {
         #Process matches
       }
    }

That will reduce then number of file read operations by a factor of 1000. 这样一来,文件读取操作的数量将减少1000倍。

Trading the select-string operation will require re-factoring the rest of the code to work with $matches[n] instead of $pattern.matches[0].groups[$n].value, but is much faster. 交易选择字符串操作将需要重构其余代码以使用$ matches [n]而不是$ pattern.matches [0] .groups [$ n] .value,但速度要快得多。 Select-String returns matchinfo objects which contain a lot of additional information about the match (line number, filename, etc.) which is great if you need it. Select-String返回matchinfo对象,该对象包含有关匹配的许多其他信息(行号,文件名等),如果需要,这是很好的选择。 If all you need is strings from the captures then it's wasted effort. 如果您需要的只是捕获中的字符串,那么这是浪费时间。

You're creating an object ($log), and then accumulating values into array properties: 您正在创建一个对象($ log),然后将值累加到数组属性中:

$log.date                += @($pattern.matches[0].groups[$n].value); $n++

that array addition is going to kill your performance. 阵列的增加会影响您的性能。 Also, hash table operations are faster than object property updates. 同样,哈希表操作比对象属性更新快。

I'd create $log as a hash table first, and the key values as array lists: 我首先将$ log创建为哈希表,并将键值创建为数组列表:

$log = @{}
$log.date = New-Object collections.arraylist

Then inside your loop: 然后在循环中:

$log.date.add($matches[1]) > $nul)

Then create your object from $log after you've populated all of the array lists. 填充所有数组列表后,然后从$ log创建对象。

As a general piece of advise, use the Measure-Command to find out which script blocks take the longest time. 作为一般建议,请使用Measure-Command来找出哪些脚本块花费的时间最长。

That being said, the sleep process seems a bit weird. 话虽如此,睡眠过程似乎有点不可思议。 If I'm not in error, you sleep 20 ms after each row: 如果我没有记错的话,那么每行之后您会睡20毫秒:

sleep -milliseconds 20

Multiply 20 ms with the log size, 100 million iterations, and you'll get quite a long total sleep time. 将日志大小乘以20 ms,进行1亿次迭代,您将获得相当长的总睡眠时间。

Try sleeping after some decent batch size. 尝试一些合适的批次大小后再睡觉。 Try if 10 000 rows is good like so, 尝试如果1万行是这样,

if($i % 10000 -eq 0) {
    write-host -nonewline "."
    start-sleep -milliseconds 20
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM