简体   繁体   English

其他两个字符串之间的grep字符串作为分隔符

[英]grep string between two other strings as delimiters

I have to do a report on how many times a certain CSS class appears in the content of our pages (over 10k pages). 我必须报告某个CSS类在我们的页面内容(超过10k页)中出现了多少次。 The trouble is, the header and footer contains that class, so a grep returns every single page. 麻烦的是,页眉和页脚包含该类,因此grep返回每个页面。

So, how do I grep for content? 那么,我该如何获取内容?

EDIT: I am looking for if a page has list-unstyled between <main> and </main> 编辑:我正在寻找是否页面具有<main></main>之间的list-unstyled

So do I use a regular expression for that grep? 那么我是否对该grep使用正则表达式? or do I need to use PowerShell to have more functionality? 还是需要使用PowerShell才能具有更多功能?

I have grep at my disposal and PowerShell, but I could use a portable software if that is my only option. 我可以使用grep和PowerShell,但是如果这是我唯一的选择,则可以使用便携式软件。

Ideally, I would get a report (.txt or .csv) with pages and line numbers where the class shows up, but just a list of the pages themselves would suffice. 理想情况下,我将获得一个报告(.txt或.csv),其中包含显示类的页面和行号,但是仅页面列表本身就足够了。

EDIT: Progress 编辑:进度

I now have this in PowerShell 我现在在PowerShell中有这个

$files = get-childitem -recurse -path w:\test\york\ -Filter *.html 
foreach ($file in $files)
{
$htmlfile=[System.IO.File]::ReadAllText($file.fullName)
$regex="(?m)<main([\w\W]*)</main>"
if ($htmlfile -match $regex) { 
    $middle=$matches[1] 
    [regex]::Matches($middle,"list-unstyled")
    Write-Host $file.fullName has matches in the middle:
}
}

Which I run with this command .\\FindStr.ps1 | Export-csv C:\\Tools\\text.csv 我使用此命令运行.\\FindStr.ps1 | Export-csv C:\\Tools\\text.csv .\\FindStr.ps1 | Export-csv C:\\Tools\\text.csv

it outputs the filename and path with string in the console, put does not add anything to the CSV. 它在控制台中输出带有字符串的文件名和路径,put不会向CSV添加任何内容。 How can I get that added in? 我如何添加它?

Don't use string matches for something like this. 请勿将字符串匹配用于此类内容。 Analyze the DOM instead. 而是分析DOM。 That should allow you to exclude headers and footers by selecting the appropriate root element. 那应该允许您通过选择适当的根元素来排除页眉和页脚。

$ie = New-Object -COM 'InternetExplorer.Application'

$url = '...'
$classname = 'list-unstyled'

$ie.Navigate($url)
do { Start-Sleep -Milliseconds 100 } until ($ie.ReadyState -eq 4)

$root = $ie.Document.getElementsById('content-element-id')
$hits = $root.getElementsByTagName('*') | ? { $_.ClassName -eq $classname }

$hits.Count  # number of occurrences of $classname below content element

You can create a regexp that will be suitable for multiline match. 您可以创建一个适合多行匹配的正则表达式。 The regexp "(?m)<!-- main content -->([\\w\\W]*)<!-- end content -->" matches a multiline content delimited by your comments, with (?m) part meaning that this regexp has multiline option enabled. 正则表达式"(?m)<!-- main content -->([\\w\\W]*)<!-- end content -->"匹配由注释分隔的多行内容,并带有(?m)部分表示此正则表达式已启用多行选项。 The group ([\\w\\W]*) matches everything between your comments, and also enables you to query $matches[1] which will contain your "main text" without headers and footers. ([\\w\\W]*)匹配注释之间的所有内容 ,还使您可以查询$matches[1] ,其中将包含没有标题和页脚的“主要文本”。

$htmlfile=[System.IO.File]::ReadAllText($fileToGrep)
$regex="(?m)<!-- main content -->([\w\W]*)<!-- end content -->"
if ($htmlfile -match $regex) { 
    $middle=$matches[1] 
    [regex]::Matches($middle,"list-unstyled")
}

This is only an example of how should you parse the file. 这仅是如何解析文件的示例。 You populate $fileToGrep with a file name which you desire to parse, then run this snippet to receive a string that contains all the list-unstyled strings in the middle of that file. 您用要解析的文件名填充$fileToGrep ,然后运行此代码段以接收一个字符串,该字符串包含该文件中间的所有未list-unstyled字符串。

What Ansgar Wiechers' answer says is good advice. Ansgar Wiechers的回答是很好的建议。 Don't string search html files. 不要使用字符串搜索html文件。 I don't have a problem with it but it is worth noting that not all html files are the same and regex searches can produce flawed results. 我对此没有问题,但值得注意的是,并非所有的html文件都是相同的,并且正则表达式搜索可能会产生错误的结果。 If tools exists that are aware of the file content structure you should use them. 如果存在知道文件内容结构的工具,则应使用它们。

I would like to take a simple approach that reports all files that have enough occurrences of the text list-unstyled in all html files in a given directory. 我想采用一种简单的方法,即报告给定目录中所有html文件中具有足够文本list-unstyled的文本list-unstyled所有文件。 You expect there to be 2? 您期望会有2个? So if more than that show up then there is enough. 因此,如果显示的不止这些,那就足够了。 I would have done a more complicated regex solution but since you want the line number as well I came up with this compromise. 我本来会做一个更复杂的正则表达式解决方案,但由于您也希望行号,所以我想出了这个折衷方案。

$pattern = "list-unstyled"
Get-ChildItem C:\temp -Recurse -Filter *.html | 
    Select-String $pattern | 
    Group-Object Path | 
    Where-Object{$_.Count -gt 2} | 
    ForEach-Object{
        $props = @{
            File = $_.Group | Select-Object -First 1 -ExpandProperty Path
            PatternFound = ($_.Group | Select-Object -ExpandProperty LineNumber) -join ";"
        }

        New-Object -TypeName PSCustomObject -Property $props
    }

Select-String is a grep like tool that can search files for string. Select-String是类似grep工具,可以在文件中搜索字符串。 It reports the located line number in the file which I why we are using it here. 它报告文件中的行号,这就是为什么我们在这里使用它。

You should get output that looks like this on your PowerShell console. 您应该在PowerShell控制台上获得如下所示的输出。

File                                                                           PatternFound                                                                  
----                                                                           ------------                                                                  
C:\temp\content.html                                                           4;11;54

Where 4,11,54 is the lines where the text was found. 其中4,11,54是找到文本的行。 The code filters out results where the count of lines is less than 3. So if you expect it once in the header and footer those results should be excluded. 该代码会过滤掉行数少于3的结果。因此,如果您希望在页眉和页脚中找到一次,则应排除这些结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM