Hello I am trying to build a webscraper for Craiglist. The below code works great based on what I am trying to do. The problem is I am using a webrowser control. I want to pass in many more URL to parse for the data. Meaning I will have a list of say 100 URLS but based on the webrowser I am not sure that I can do what I want.
I looked into WebRequest but if i do webrequest it seems that I would have to parse the data as though its a text file rather than an html where I cannot get the attributes of the HTML the way I am below. Any help would be great.
Private Sub btnGetData_Click(sender As Object, e As EventArgs) Handles btnGetData.Click
clsScrape.ScrapeHTML(WebBrowser1, dgvData, "http://newyork.craigslist.org")
End Sub
Public Shared Sub ScrapeHTML(ByVal webBrows As WebBrowser, ByRef DataGridView1 As DataGridView, ByVal strCityLink As String)
'Change list box to datagridview to add rows. Will be passing multiple cities
For Each element As HtmlElement In webBrows.Document.All
Dim WebDate As String = ""
If element.GetAttribute("className") = "result-info" Then
'loop though the children element
For Each child As HtmlElement In element.Children
'if the dat is today capture loop else exit
If child.GetAttribute("className") = "result-date" Then
If child.InnerHtml = "Dec 30" Then
WebDate = child.InnerHtml
Else
Exit For
End If
End If
If child.GetAttribute("className") = "result-title hdrlnk" Then
Dim input As String = child.OuterHtml
Dim result As String() = input.Split("""")
Dim link As String = strCityLink & result(3)
Dim Title As String = child.InnerHtml
DataGridView1.Rows.Add(New String() {WebDate, Title, link})
End If
Next
End If
Next
End Sub
Definitely use the HTTPWebRequest on in your cause you can get away with the webclient, then load that information into the HTMLAgilityPack and then you can pull out the information you're looking for.
Dim oWebClient As New WebClient()
oWebClient.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36")
Dim html = new HtmlAgilityPack.HtmlDocument()
html.LoadHtml(oWebClient.DownloadString(URL))
For Each node As HtmlAgilityPack.HtmlNode In _HTMLDocument.DocumentNode.SelectNodes("//*[@class=""result-date""]")
Next
You'll find thousands of examples on how to use the HTMLAgilityPack. This is just to get you started, spend a little time working with it. You can easily accomplish what you want to accomplish.
Also keep in mind WebRequest and WebClient are single requests. Web Browsers go out and build in entire webpage (Which might consist of many requests). A webclient or webrequest won't render a page as a browser would, because the browser applies all loads in all the external content the webpage might be linking to.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.