简体   繁体   中英

Removing <div class> code blocks from a text file

I have extracted some information from an .html file and put it into a text file. The information is there, but once in a while i have something like this coming up:

info, info...
info, info...

    <div class="ratings-link"> <img alt="arrows" class="icon" src= bla bla...</a></div>"

info, info...
info, info...

What i want to do is basically remove everything which is not info. (Getting rid of:

<div class="ratings-link" ....bla bla... </a></div> 

all together.

What is the best way/tool to achieve this? I wrote a C program with scanf but it wouldn't work since not all these divs has same end string. But they all share same mentioned pattern.

If this were me I would write a quick script in either PHP or Python to do this.

PHP has the strip_tags function: http://www.php.net//manual/en/function.strip-tags.php

Python has a library called beautiful soup which is very mature and great for this kind of thing: http://www.crummy.com/software/BeautifulSoup/

Or how about any language that has Regex support removing all that match <[^>]*>

you can use RegEx in almost any programming language in order to filter your text and remove unneeded information.

in your case, the relevant regex would be: <[^>]*>

and here is an example in c#:

using System;
using System.Text.RegularExpressions;
public class Program
{
    static string myString =  "info, info..." + 
    Environment.NewLine + "info, info..." + 
    Environment.NewLine + "<div class='ratings-link'> <img alt='arrows' class='icon' src= bla bla...</a></div>" + 
    Environment.NewLine + "info, info..." + 
    Environment.NewLine + "info, info...";
    public static void Main()
    {
        String result = Regex.Replace(myString, @"<[^>]*>", String.Empty);
        Console.WriteLine(result);
    }
}

Live Example

What you are actually trying to do, is stripping any html tags from your text. The simplest way to occasionally strip tags via copy and paste is to use an online tool like http://www.striphtml.com/ or maybe even more convenient http://www.zubrag.com/tools/html-tags-stripper.php which offers the alternative of entering a url to strip from (makes extracting text first obsolete) and lets you choose tags you may want to exclude.

If i misunderstood you, and it is your intention to write some html stripper program, well, every language/platform that i know of has functions, that achieve exactly that. PHP fe has the strip_tags() function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM