简体   繁体   English

wget:拖网样本空间为100000000,返回最多100个结果

[英]wget: Trawl sample space of 100000000, max 100 results returned

Not sure if this is stack or code review, as I'm open to completely different approaches to the problem and, though I've started with PowerShell, am not wedded to a particular language or style. 不确定这是堆栈还是代码审查,因为我对这个问题的方法完全不同,虽然我已经开始使用PowerShell,但我并不喜欢特定的语言或风格。

I'm currently working with a web server on which we aren't authorised to access the back end. 我目前正在使用我们无权访问后端的Web服务器。

It returns a list of generated certificates based on a left-justified filter, eg if you type 100 in the search box and click submit it will search for all certificates beginning with 100*, or the range 10000000 - 10099999 它返回基于左对齐过滤器生成的证书列表,例如,如果您在搜索框中键入100并单击提交,它将搜索以100 *开头的所有证书,或者范围10000000 - 10099999

All of our certs are eight digit numbers giving a sample space of 00000000-99999999. 我们所有的证书都是八位数字,样本空间为00000000-99999999。 I'm attempting to find which certificates, in this sample space, actually exist, given the certificate names must be unique. 我试图在这个示例空间中找到哪些证书确实存在,因为证书名称必须是唯一的。

The major caveat is that the server will only return the first 100 results, if your query returns more than that many results due to there being more than 100 extant certificates in that range, the extras are discarded. 主要的警告是服务器将只返回前100个结果,如果由于在该范围内存在超过100个现有证书而导致查询返回的结果超过许多结果,则会丢弃附加内容。

My first approach was to just use wget (technically PowerShell's Invoke-WebRequest) and iterate through the range of queries 000000 to 999999 (100 at a time), which was working & I was on track for a mid-September finish. 我的第一种方法是使用wget(技术上是PowerShell的Invoke-WebRequest)并遍历查询范围000000到999999(每次100个),这是有效的,我正在进行9月中旬完成。

Unfortunately there are people that want this data sooner, so I've had to write a recursive function that (with my default input) queries a ten million-certs large sample space at once and searches a progressively smaller space until < 99 certs are returned for each subspace, then moving onto the next ten million. 不幸的是,有些人希望更快地获得这些数据,所以我不得不编写一个递归函数(使用我的默认输入)一次查询一个1000万个证书的大样本空间并搜索逐渐变小的空间,直到返回<99个证书对于每个子空间,然后进入下一个千万。

The data isn't evenly distributed or very predictable, 'most' (~90%?) certs cluster around 10000000-19999999 and 30000000-39999999 but I need them all. 数据不是均匀分布或非常可预测的,“大多数”(~90%?)证书聚集在10000000-19999999和30000000-39999999附近,但我需要它们。

Here's the function I'm currently using, it seems to be working (results are being written to file, and faster than before), but it is still ongoing. 这是我正在使用的功能,它似乎正在工作(结果写入文件,比以前更快),但它仍在进行中。 Are there any: 有没有:

  1. Glaring errors with the function 使用该功能引起错误
  2. Better choices of inputs (for better efficiency) 更好的投入选择(提高效率)
  3. Completely different approaches that would be better 完全不同的方法会更好

The variable '$certsession' is established outside this snippet and represents the web server session (login information, cookies etc.) 变量'$ certsession'在此代码段之外建立,代表Web服务器会话(登录信息,cookie等)

function RecurseCerts ($min,$max,$step,$level) {
    for ($certSpace = $min; $certSpace -le $max; $certSpace += $step) {
        $levelMultiplier = "0" * $level
        #Assuming a level of 3, these ToString arguments would turn a '5' into 005, a '50' into 050, and so on. Three or more digit numbers are unchanged.
        $query = ($certSpace).ToString($levelMultiplier)
        $resultsArray = New-Object System.Collections.ArrayList
        "Query is $query"
        #Get webpage, split content by newline, search for lines with a certificate common name and add them to the results array
        Invoke-WebRequest -uri "https://webserver.com/app?service=direct%2F1%2FSearchPage%2F%24Form&sp=S0&Form0=%24TextField%2C%24Submit&%24TextField=$query&%24Submit=Search" -websession $certsession  | %{$_.content -split "`n" | %{if ($_ -match "cn=(.*?),ou") {$resultsArray = $resultsArray + $matches[1]}}}
        #If we got more than 98 results for our query, make the search more specific, until we don't get more than 98 (else condition).
        if ($resultsArray.count -gt 98) {"Recursing at $certSpace"; $subLevel = $level + 1; $subSpace = $certSpace * 10;  RecurseCerts -min $subSpace -max ($subSpace + 9) -step 1 -level $subLevel}
        #This is the most specific 0-98 for this range, write it out to the file
        else {"Completed range $certspace"; $resultsArray | out-file c:\temp\certlist.txt -encoding utf8 -append}
    }
}

#Level 3 means include rightmost 3 digits eg. search 101 for range 10100000 - 10199999
#Level 4 would be the subspace 1010-1019 (so a search for 1015 returns 10150000 - 10159999)
RecurseCerts -min 0 -max 9 -step 1 -level 1

Since I've added 'language agnostic', feel free to ask for any needed PowerShell clarifications. 由于我添加了“语言不可知”,请随时询问任何所需的PowerShell说明。 I could also attempt to re-write it in pseudo-code if desired. 如果需要,我也可以尝试用伪代码重写它。

I think the fact that ranges are already iterated should prevent duplication when it is done with a subspace and jumps back to the higher level (re-capturing things it already captured at a lower level should be prevented), but I'd be lying if I said I fully understood the program flow here. 我认为范围已经迭代的事实应该可以防止重复使用子空间并跳回到更高级别(重新捕获已经在较低级别捕获的内容应该被阻止),但我会撒谎,如果我说我完全理解这里的程序流程。

If it turns out there is duplication I can just filter the text file for duplicates. 如果事实证明存在重复,我可以过滤文本文件以获得重复项。 However, I'd still be interested in approaches that eliminate this problem if it exists. 但是,如果它存在,我仍然对消除这个问题的方法感兴趣。

*I've updated the code to display an indicator of progress to the console, and based on suggestions also changed the array type used to arraylist. *我已更新代码以显示控制台的进度指示器,并根据建议还更改了用于arraylist的数组类型。 The server is pretty fragile so I've avoided multi-threading for now, but it would normally be a useful feature of tasks like this - here's a summary of some ways to do this in PowerShell. 服务器非常脆弱所以我现在已经避免了多线程,但它通常是这样的任务的有用功能 - 这里是在PowerShell中执行此操作的一些方法的摘要。

Here's an example of the behaviour currently. 这是当前行为的一个例子。 Notably the entire ten-million range 00000000 - 09999999 had less than 98 certificates and was thus processed without needing a recursion. 值得注意的是,整个1000万范围的00000000-09999999具有少于98个证书,因此无需递归即可进行处理。

RecurseCerts behaviour RecurseCerts行为

Moving my comments to an answer: 将我的评论转到答案:

  • First suggestion: become authorised to access the back end. 第一个建议:获得授权访问后端。

  • The big room for performance increase is threading/split work over multiple clients. 性能提升的最大空间是多个客户端的线程/拆分工作。 Since it's just a big space of numbers you could easily: 因为它只是一个很大的数字空间你可以轻松地:

    • have two PowerShell processes running, searching 00000000-49999999 in one and 50000000-99999999 in the other (or as many processes as you want). 运行两个PowerShell进程,在一个中搜索00000000-49999999,在另一个中搜索50000000-99999999(或根据需要搜索多个进程)。
    • have two computers doing it, if you have others to access 有两台计算机在做,如果你有其他人可以访问
    • use PowerShell multiprocessing (threading, jobs, workflows) although those are more complex to use. 使用PowerShell多处理(线程,作业,工作流),尽管使用它们更复杂。
    • Since this is mostly going to be server/network bound slowness, it's probably not worth the more difficult techniques, but script with start/end numbers and run it twice would be quite easy. 由于这主要是服务器/网络绑定速度慢,所以可能不值得采用更难的技术,但带有开始/结束编号并运行两次的脚本将非常容易。
  • The code $resultsArray = $resultsArray + $matches[1] is very slow; 代码$resultsArray = $resultsArray + $matches[1]非常慢; arrays are immutable (fixed size) so this causes PowerShell to make a new array and copy the array into it. 数组是不可变的(固定大小),因此这会导致PowerShell创建一个新数组并将数组复制到其中。 In a loop, adding many thousands of things, it will have a lot of overhead. 在循环中,添加成千上万的东西,会产生很多开销。 Use $a = [System.Collections.ArrayList]@() and $a.Add($thing) instead. 使用$a = [System.Collections.ArrayList]@()$a.Add($thing)代替。

  • How fast can the server respond (is it on the LAN or Internet)? 服务器响应的速度有多快(是在局域网还是互联网上)? If it's over a WAN connection there's a latency limit to how fast you can go, but if it's searching a big database and takes a while to return a page, that puts a bigger limit on what you can speed up from the client side. 如果它是通过WAN连接,那么对你的速度有一个延迟限制,但如果它正在搜索一个大型数据库并需要一段时间才能返回一个页面,这会对你从客户端加速的速度产生更大的限制。

  • How big is the response page? 响应页面有多大? Invoke-WebRequest parses the HTML into a full DOM and it's very slow, and you're not using the DOM so you don't need that. Invoke-WebRequest将HTML解析为完整的DOM并且它非常慢,并且您没有使用DOM,因此您不需要它。 You can use [System.Net.WebClient] to download the content as a string: 您可以使用[System.Net.WebClient]将内容下载为字符串:

eg 例如

$web = New-Object System.Net.WebClient
$web.DownloadString($url)
  • In terms of design, how many certificates are you expecting in the 100M search space? 在设计方面,您期望在100M搜索空间中有多少证书? 10k? 10K? 50M? 50公尺? Your recursive function risks searching and pulling and ignoring the same certificates over and over trying to get below 100. Depending on distribution, I'd be tempted to look for and block-out the biggest chunks with 0 certificates. 您的递归函数可能会一次又一次地冒险搜索和拉取并忽略相同的证书。根据分布情况,我很想找到并阻止带有0个证书的最大块。 If you can rule out a range of 1M in one request that's enormously useful. 如果您可以在一个请求中排除1M范围,这非常有用。 Searching 1M, finding too many certs, searching 500K, too many, [...] searching 10K finding too many, seems wasteful and slow. 搜索1M,找到太多的证书,搜索500K,太多,搜索10K找到太多,似乎浪费和缓慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Max Heapify算法结果 - Max Heapify algorithm results 以索引形式返回结果的二进制搜索 - Binary Search with results returned as Indices 迭代数字样本空间的算法 - Algorithm to iterate through sample space of numbers 如何在 PHP 中在不到 1 分钟的时间内计算 0 到 100000000 之间的素数? - How to count prime number between 0 to 100000000 in less than 1 minute in PHP? 如何随机选择最大化占用空间的样本点? - How to randomly choose sample points that maximize space occupation? 将任何数字数组转换为最大值为 100 的比例值数组 - Convert any array of numbers to an array of proportional values with a max value of 100 在 3D 空间中以最小最近邻距离和最大密度随机采样给定点 - Sample given points stochastically in a 3D space with minimum nearest-neighbor distance and maximum density 一种有效的算法,可对小于一个大最大值(例如100_000)的4个唯一整数进行随机处理 - Effective algorithm to random 4 unique integers less than a big max such as 100_000 在数字之间放置符号,使数字为100 - Put signs between digits so that the number results in 100 假设矩阵空间为max(M,N)* max(M,N),是否有好方法顺时针旋转M * N 2D矩阵 - Is there good method to clockwise rotate M*N 2D matrix in place, assuming matrix space of max(M,N)*max(M,N)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM