简体   繁体   中英

preprocessing html data with powershell

i have some html source code of customer data that needs to be cleaned from html tags before deployed with a line joining string split.

i want to be able to target specific types of information. if for example a customer has a list of categories on his page. each 'category' sits, perched inside of an easily distinguishable tag:

<span _ngcontent-jal-c67="" class="category-name">Cryptocurrency</span>

would it be possible to remove everything else that is not nested inside a similar html tag?

let's say, for exampple i want evrything thats occurs inside of <span *>*</span> . so that every non <span></span> tag and its contents would be removed. the contents of all the <span ***>***</span> would stay, without the tag. is that something i could do in powershell? let's avoid paste.exe and cygwin type of stuff. i'm looking for standard native windows approach (cmd or powershell).

again, i want to remove all tags.

just the contents that i don't remove should be limited to those found in a specific tag. like , <span _ngcontent-jal-c68="" class="category-name">Shopping</span> everything that fits the <span *>*</span> profile

leave only the contents. no tag.

from: <span _ngcontent-jal-c32="" class="category-name">Home and Graden</span>

to: Home and Graden

i'm really looking for an answer for how to do this in powershell without needing to install anything or to make any interesting changes to the OS (windows10)

Please try to investigate into the problem before asking on Stackoverflow. Did you know there is a -replace operator in PowerShell which allows you to use RegEx? Did you identify that RegEx might help you with your problem?

Anyway, here is one approach, you could take.

$html = '<span _ngcontent-jal-c32="" class="category-name">Home and Graden</span>'
if ($html -match '(<span.*>)(?<Category>.+)(</span>)') { 
    $Matches.Category 
}

Home and Graden

The -match operator can test for a RegEx. The RegEx (<span.*>)(?<Category>.+)(</span>) will create three groups, one of which is named Category . The category sits in between the span-tags. For your input, you have to be sure that any categories will sit inside of a span-tag. If -match returns true, the automatic variable $Matches is filled. Since we named second group Category , we can easily access it as a property with $Matches.Category .

Alternatively, and for more complex html files even preferrably, you can parse html with PowerShell, see Powershell Tip : Parsing HTML from a local File or a String

Instead of using delicate Regular Expressions , you might just use the [System.Net.WebUtility]::HtmlDecode method for this:

$Html = '<span _ngcontent-jal-c67="" class="category-name">Cryptocurrency</span>'
([Xml][System.Net.WebUtility]::HtmlDecode($Html)).GetElementsByTagName('span').'#text'

Result:

Cryptocurrency

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM