Download HTML file and convert it to TXT

Question

I am writing a program in c#. I need to know if there an option to open an URL of a site and look for keywords in the text. For example if my program gets the URL http://www.google.com and the keyword "gmail" it will return true. So for conclusion i need to know if there a way to go to URL download the HTML file convert it to text so i could look for my keyword.

Answer 1

It sounds like you want to remove all the HTML tags and then search the resulting text.

My first reaction was to use a Regular Expression:

String result = Regex.Replace(htmlDocument, @"<[^>]*>", String.Empty);

Shamelessly stole this from: Using C# regular expressions to remove HTML tags

Which suggests the HTML Agility Pack which sounds exactly like what you're looking for.

Answer 2

In visual basic this works:

Imports System
Imports System.IO
Imports System.Net

Function MakeRequest(ByVal url As String) As String
    Dim request As WebRequest = WebRequest.Create(url)
    ' If required by the server, set the credentials. '
    request.Credentials = CredentialCache.DefaultCredentials
    ' Get the response. '
    Dim response As HttpWebResponse = CType(request.GetResponse(), HttpWebResponse)
    ' Get the stream containing content returned by the server. '
    Dim dataStream As Stream = response.GetResponseStream()
    ' Open the stream using a StreamReader for easy access. '
    Dim reader As New StreamReader(dataStream)
    Dim text As String = reader.ReadToEnd

    Return text
End Function

Edit: For future reference for others that find this page, you pass in a URL, and this function will go to the page, read all the html text, and return it as a text string. then all you have to do is parse it (search for text in the file) or you could use a stream writer to save it to a text or html file if you wanted to.

Answer 3

You should be able to open the HTML file as-is. HTML files are plaintext, meaning that FileStream and StreamReader should be sufficient to read the file.

If you really want the file to be a .txt, you can simply save the file as filename.txt instead of filename.html when you download it.

Answer 4

using (WebClient client = new WebClient()) 
{
   client.DownloadFile("http://example.com", @"D:\filename.txt");
}

Answer 5

Do not use regular expressions for parsing html, as html is fairly complex for regular expresions. Check out ling discussion on SO for this

RegEx match open tags except XHTML self-contained tags

Use instead already implemented HTML parsers for this purpose.

Here is another discussion on SO where you can find a links you need

Looking for C# HTML parser

Search also on internet by yourself.

Download HTML file and convert it to TXT

Question

5 answers

solution1
2 2011-08-18 17:33:12

solution2
1 2011-08-18 17:34:31

solution3
1 ACCPTED 2011-08-18 17:35:03

solution4
0 2017-01-24 08:31:38

solution5
0 2011-08-18 17:43:00

Download HTML file and convert it to TXT

Question

5 answers

solution1 2 2011-08-18 17:33:12

solution2 1 2011-08-18 17:34:31

solution3 1 ACCPTED 2011-08-18 17:35:03

solution4 0 2017-01-24 08:31:38

solution5 0 2011-08-18 17:43:00

solution1
2 2011-08-18 17:33:12

solution2
1 2011-08-18 17:34:31

solution3
1 ACCPTED 2011-08-18 17:35:03

solution4
0 2017-01-24 08:31:38

solution5
0 2011-08-18 17:43:00