简体   繁体   中英

Web Scraping with c# and HTMLAgilityPack

Screenshot of the code and error message+variable values So, the goal is to take a word and get the part of speech of the word from its google definition.

I've tried a few different approaches but I'm getting a null reference error every time. Is my code failing to access the webpage? Is it a firewall issue, a logic issue, an {insert-issue-here} problem? I really wish i had a vague idea of what is wrong.

Thanks for your time.

Addendum: I've tried " // [@id=\\"source - luna\\"]//div " and " // [@id=\\"source - luna\\"]/div 1 " as XPath values.

 //attempt 1//////////////////////////////////////////////////////////////////////// var term = "Hello"; HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.urbandictionary.com/define.php?term=" + term); HttpWebResponse response = (HttpWebResponse)request.GetResponse(); StreamReader stream = new StreamReader(response.GetResponseStream()); string final_response = stream.ReadToEnd(); MessageBox.Show(final_response); //doesn't execute //attempt 2//////////////////////////////////////////////////////////////////////// var url = "https://www.google.co.za/search?q=define+position"; var content = new System.Net.WebClient().DownloadString(url); var webGet = new HtmlWeb(); var doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(content); //doc is null at runtime HtmlNode ourNode = doc.DocumentNode.SelectSingleNode("//*[@id=\\"uid_0\\"]/div[1]/div/div[1]/div[2]/div[2]/div[1]/i/span"); if (ourNode != null) { richTextBox1.AppendText(ourNode.InnerText); } else richTextBox1.AppendText("null"); //attempt 3//////////////////////////////////////////////////////////////////////// var webGet = new HtmlWeb(); var doc = webGet.Load("https://www.google.co.za/search?q=define+position"); //doc is null at runtime HtmlNode ourNode = doc.DocumentNode.SelectSingleNode("//*[@id=\\"uid_0\\"]/div[1]/div/div[1]/div[2]/div[2]/div[1]/i/span"); if (ourNode != null) { richTextBox1.AppendText(ourNode.InnerText); } else richTextBox1.AppendText("null"); //attempt 4//////////////////////////////////////////////////////////////////////// string Url = "http://www.metacritic.com/game/pc/halo-spartan-assault"; HtmlWeb web = new HtmlWeb(); HtmlAgilityPack.HtmlDocument doc = web.Load(Url); //doc is null at runtime string metascore = doc.DocumentNode.SelectNodes("//*[@id=\\"main\\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText; string userscore = doc.DocumentNode.SelectNodes("//*[@id=\\"main\\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText; string summary = doc.DocumentNode.SelectNodes("//*[@id=\\"main\\"]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]")[0].InnerText; richTextBox1.AppendText(metascore + " " + userscore + " " + summary); //attempt 5//////////////////////////////////////////////////////////////////////// HtmlWeb web = new HtmlWeb(); HtmlAgilityPack.HtmlDocument html = web.Load("https://www.google.co.za/search?q=define+position"); //html is null var div = html.DocumentNode.SelectNodes("//*[@id=\\"uid_0\\"]/div[1]/div/div[1]/div[2]/div[2]/div[1]/i/span"); richTextBox1.AppendText(Convert.ToString(div)); 

You are getting null because your XPATHs aren't correct or it couldn't find any node based on those XPATHs. What are you trying to achieve here?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM