简体   繁体   中英

VB.NET HTML Loops

Hello I am trying to build a webscraper for Craiglist. The below code works great based on what I am trying to do. The problem is I am using a webrowser control. I want to pass in many more URL to parse for the data. Meaning I will have a list of say 100 URLS but based on the webrowser I am not sure that I can do what I want.

I looked into WebRequest but if i do webrequest it seems that I would have to parse the data as though its a text file rather than an html where I cannot get the attributes of the HTML the way I am below. Any help would be great.

Private Sub btnGetData_Click(sender As Object, e As EventArgs) Handles btnGetData.Click
    
     clsScrape.ScrapeHTML(WebBrowser1, dgvData, "http://newyork.craigslist.org")
End Sub



   Public Shared Sub ScrapeHTML(ByVal webBrows As WebBrowser, ByRef DataGridView1 As DataGridView, ByVal strCityLink As String)
    'Change list box to datagridview to add rows. Will be passing multiple cities 
    For Each element As HtmlElement In webBrows.Document.All

        Dim WebDate As String = ""

        If element.GetAttribute("className") = "result-info" Then

            'loop though the children element
            For Each child As HtmlElement In element.Children

                'if the dat is today capture loop else exit 
                If child.GetAttribute("className") = "result-date" Then
                    If child.InnerHtml = "Dec 30" Then

                        WebDate = child.InnerHtml

                    Else
                        Exit For
                    End If
                End If



                If child.GetAttribute("className") = "result-title hdrlnk" Then

                    Dim input As String = child.OuterHtml
                    Dim result As String() = input.Split("""")
                    Dim link As String = strCityLink & result(3)
                    Dim Title As String = child.InnerHtml

                    DataGridView1.Rows.Add(New String() {WebDate, Title, link})

                End If
            Next

        End If
    Next

End Sub

Definitely use the HTTPWebRequest on in your cause you can get away with the webclient, then load that information into the HTMLAgilityPack and then you can pull out the information you're looking for.

   Dim oWebClient As New WebClient()
   oWebClient.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36")

   Dim html = new HtmlAgilityPack.HtmlDocument()
   html.LoadHtml(oWebClient.DownloadString(URL))

  For Each node As HtmlAgilityPack.HtmlNode In _HTMLDocument.DocumentNode.SelectNodes("//*[@class=""result-date""]")
  Next

You'll find thousands of examples on how to use the HTMLAgilityPack. This is just to get you started, spend a little time working with it. You can easily accomplish what you want to accomplish.

Also keep in mind WebRequest and WebClient are single requests. Web Browsers go out and build in entire webpage (Which might consist of many requests). A webclient or webrequest won't render a page as a browser would, because the browser applies all loads in all the external content the webpage might be linking to.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM