简体   繁体   中英

Download HTML file and convert it to TXT

I am writing a program in c#. I need to know if there an option to open an URL of a site and look for keywords in the text. For example if my program gets the URL http://www.google.com and the keyword "gmail" it will return true. So for conclusion i need to know if there a way to go to URL download the HTML file convert it to text so i could look for my keyword.

It sounds like you want to remove all the HTML tags and then search the resulting text.

My first reaction was to use a Regular Expression:

String result = Regex.Replace(htmlDocument, @"<[^>]*>", String.Empty);

Shamelessly stole this from: Using C# regular expressions to remove HTML tags

Which suggests the HTML Agility Pack which sounds exactly like what you're looking for.

In visual basic this works:

Imports System
Imports System.IO
Imports System.Net

Function MakeRequest(ByVal url As String) As String
    Dim request As WebRequest = WebRequest.Create(url)
    ' If required by the server, set the credentials. '
    request.Credentials = CredentialCache.DefaultCredentials
    ' Get the response. '
    Dim response As HttpWebResponse = CType(request.GetResponse(), HttpWebResponse)
    ' Get the stream containing content returned by the server. '
    Dim dataStream As Stream = response.GetResponseStream()
    ' Open the stream using a StreamReader for easy access. '
    Dim reader As New StreamReader(dataStream)
    Dim text As String = reader.ReadToEnd

    Return text
End Function

Edit: For future reference for others that find this page, you pass in a URL, and this function will go to the page, read all the html text, and return it as a text string. then all you have to do is parse it (search for text in the file) or you could use a stream writer to save it to a text or html file if you wanted to.

You should be able to open the HTML file as-is. HTML files are plaintext, meaning that FileStream and StreamReader should be sufficient to read the file.

If you really want the file to be a .txt, you can simply save the file as filename.txt instead of filename.html when you download it.

using (WebClient client = new WebClient()) 
{
   client.DownloadFile("http://example.com", @"D:\filename.txt");
}

Do not use regular expressions for parsing html, as html is fairly complex for regular expresions. Check out ling discussion on SO for this

RegEx match open tags except XHTML self-contained tags

Use instead already implemented HTML parsers for this purpose.

Here is another discussion on SO where you can find a links you need

Looking for C# HTML parser

Search also on internet by yourself.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM