I am writing a program in c#. I need to know if there an option to open an URL of a site and look for keywords in the text. For example if my program gets the URL http://www.google.com and the keyword "gmail" it will return true. So for conclusion i need to know if there a way to go to URL download the HTML file convert it to text so i could look for my keyword.
It sounds like you want to remove all the HTML tags and then search the resulting text.
My first reaction was to use a Regular Expression:
String result = Regex.Replace(htmlDocument, @"<[^>]*>", String.Empty);
Shamelessly stole this from: Using C# regular expressions to remove HTML tags
Which suggests the HTML Agility Pack which sounds exactly like what you're looking for.
In visual basic this works:
Imports System
Imports System.IO
Imports System.Net
Function MakeRequest(ByVal url As String) As String
Dim request As WebRequest = WebRequest.Create(url)
' If required by the server, set the credentials. '
request.Credentials = CredentialCache.DefaultCredentials
' Get the response. '
Dim response As HttpWebResponse = CType(request.GetResponse(), HttpWebResponse)
' Get the stream containing content returned by the server. '
Dim dataStream As Stream = response.GetResponseStream()
' Open the stream using a StreamReader for easy access. '
Dim reader As New StreamReader(dataStream)
Dim text As String = reader.ReadToEnd
Return text
End Function
Edit: For future reference for others that find this page, you pass in a URL, and this function will go to the page, read all the html text, and return it as a text string. then all you have to do is parse it (search for text in the file) or you could use a stream writer to save it to a text or html file if you wanted to.
You should be able to open the HTML file as-is. HTML files are plaintext, meaning that FileStream
and StreamReader
should be sufficient to read the file.
If you really want the file to be a .txt, you can simply save the file as filename.txt
instead of filename.html
when you download it.
using (WebClient client = new WebClient())
{
client.DownloadFile("http://example.com", @"D:\filename.txt");
}
Do not use regular expressions for parsing html, as html is fairly complex for regular expresions. Check out ling discussion on SO for this
RegEx match open tags except XHTML self-contained tags
Use instead already implemented HTML parsers for this purpose.
Here is another discussion on SO where you can find a links you need
Search also on internet by yourself.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.