简体   繁体   中英

Parse HTML, multiple classes

I want to create a PowerShell script to get infromation from a website. I am trying to find the first occurence of the following HTML tag of the website:

<div class="dDoNo gsrt"><span data-dobid="hdw">Text I want to find</span></div>

I am using the following PowerShell code without success, gives me not output:

$WebResponse = Invoke-WebRequest "https://www.google.co.in/search?hl=en&q=define+Text"
($WebResponse.ParsedHtml.GetElementsByTagName(‘div’) | Where {
    $_.ClassName -eq ‘dDoNo’
}).InnerText

To be more precise: I am trying to get the definition of a word by scraping the HTML from google and am using this class as a base: googleDictionaryAPI class

For one thing, you need to call GetElementsByTagName() on the DocumentElement child node of ParsedHtml , otherwise you don't get any results at all. Also, the class string "dDoNo gsrt" does not equal "dDoNo", so you need to test if the value contains the class name "dDoNo".

Change

($WebResponse.ParsedHtml.GetElementsByTagName(‘div’) | Where {
    $_.ClassName -eq ‘dDoNo’
}).InnerText

to

($WebResponse.ParsedHtml.DocumentElement.GetElementsByTagName('div') | Where {
    $_.ClassName -match '\bdDoNo\b'
}).InnerText

and the code should do what you want.

Note that using typographic quotes ( ' ) in code is not recommended. While they work most of the time I did encounter situations where they caused things to break in interesting ways. Use plain quotes instead ( ' ).

Thanks to @Ansgar to pointing me to the correct solution.

The main problem was that the response I got from Invoke-WebRequest was different than the one i got from a browser. The solution was to define a UserAgent when invoking the request:

$WebResponse = (Invoke-WebRequest -Uri "https://www.google.co.in/search?hl=en&q=define+Text" -UserAgent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36")

($WebResponse.ParsedHtml.DocumentElement.GetElementsByTagName('div') | Where {
    $_.ClassName -match '\bdDoNo\b'
}).InnerText

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM