简体   繁体   English

VB.NET HTML 循环

[英]VB.NET HTML Loops

Hello I am trying to build a webscraper for Craiglist.你好,我正在尝试为 Craiglist 构建一个 webscraper。 The below code works great based on what I am trying to do.根据我正在尝试做的事情,下面的代码效果很好。 The problem is I am using a webrowser control.问题是我正在使用 webrowser 控件。 I want to pass in many more URL to parse for the data.我想传入更多的 URL 来解析数据。 Meaning I will have a list of say 100 URLS but based on the webrowser I am not sure that I can do what I want.这意味着我将有一个包含 100 个 URL 的列表,但基于网络浏览器,我不确定我是否可以做我想做的事。

I looked into WebRequest but if i do webrequest it seems that I would have to parse the data as though its a text file rather than an html where I cannot get the attributes of the HTML the way I am below.我查看了 WebRequest,但是如果我执行 webrequest,似乎我必须将数据解析为文本文件而不是 html,在这种情况下我无法像下面那样获取 HTML 的属性。 Any help would be great.任何帮助都会很棒。

Private Sub btnGetData_Click(sender As Object, e As EventArgs) Handles btnGetData.Click
    
     clsScrape.ScrapeHTML(WebBrowser1, dgvData, "http://newyork.craigslist.org")
End Sub



   Public Shared Sub ScrapeHTML(ByVal webBrows As WebBrowser, ByRef DataGridView1 As DataGridView, ByVal strCityLink As String)
    'Change list box to datagridview to add rows. Will be passing multiple cities 
    For Each element As HtmlElement In webBrows.Document.All

        Dim WebDate As String = ""

        If element.GetAttribute("className") = "result-info" Then

            'loop though the children element
            For Each child As HtmlElement In element.Children

                'if the dat is today capture loop else exit 
                If child.GetAttribute("className") = "result-date" Then
                    If child.InnerHtml = "Dec 30" Then

                        WebDate = child.InnerHtml

                    Else
                        Exit For
                    End If
                End If



                If child.GetAttribute("className") = "result-title hdrlnk" Then

                    Dim input As String = child.OuterHtml
                    Dim result As String() = input.Split("""")
                    Dim link As String = strCityLink & result(3)
                    Dim Title As String = child.InnerHtml

                    DataGridView1.Rows.Add(New String() {WebDate, Title, link})

                End If
            Next

        End If
    Next

End Sub

Definitely use the HTTPWebRequest on in your cause you can get away with the webclient, then load that information into the HTMLAgilityPack and then you can pull out the information you're looking for.一定要在您的原因中使用 HTTPWebRequest,您可以摆脱网络客户端,然后将该信息加载到HTMLAgilityPack 中,然后您可以提取您正在寻找的信息。

   Dim oWebClient As New WebClient()
   oWebClient.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36")

   Dim html = new HtmlAgilityPack.HtmlDocument()
   html.LoadHtml(oWebClient.DownloadString(URL))

  For Each node As HtmlAgilityPack.HtmlNode In _HTMLDocument.DocumentNode.SelectNodes("//*[@class=""result-date""]")
  Next

You'll find thousands of examples on how to use the HTMLAgilityPack.您将找到数以千计的关于如何使用 HTMLAgilityPack 的示例。 This is just to get you started, spend a little time working with it.这只是让你开始,花一点时间来处理它。 You can easily accomplish what you want to accomplish.您可以轻松完成您想要完成的任务。

Also keep in mind WebRequest and WebClient are single requests.还要记住 WebRequest 和 WebClient 是单个请求。 Web Browsers go out and build in entire webpage (Which might consist of many requests). Web 浏览器出去构建整个网页(可能包含许多请求)。 A webclient or webrequest won't render a page as a browser would, because the browser applies all loads in all the external content the webpage might be linking to. webclient 或 webrequest 不会像浏览器那样呈现页面,因为浏览器会在网页可能链接到的所有外部内容中应用所有负载。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM