简体   繁体   中英

Stripping out HTML tags from a string

How do I remove HTML tags from a string so that I can output clean text?

let str = string.stringByReplacingOccurrencesOfString("<[^>]+>", withString: "", options: .RegularExpressionSearch, range: nil)
print(str)

Hmm, I tried your function and it worked on a small example:

var string = "<!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>"
let str = string.stringByReplacingOccurrencesOfString("<[^>]+>", withString: "", options: .RegularExpressionSearch, range: nil)
print(str)

//output "  My First Heading My first paragraph. "

Can you give an example of a problem?

Swift 4 and 5 version:

var string = "<!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>"
let str = string.replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression, range: nil)

Since HTML is not a regular language (HTML is a context-free language), you cannot use Regular Expressions. See: Using regular expressions to parse HTML: why not?

I would consider using NSAttributedString instead.

let htmlString = "LCD Soundsystem was the musical project of producer <a href='http://www.last.fm/music/James+Murphy' class='bbcode_artist'>James Murphy</a>, co-founder of <a href='http://www.last.fm/tag/dance-punk' class='bbcode_tag' rel='tag'>dance-punk</a> label <a href='http://www.last.fm/label/DFA' class='bbcode_label'>DFA</a> Records. Formed in 2001 in New York City, New York, United States, the music of LCD Soundsystem can also be described as a mix of <a href='http://www.last.fm/tag/alternative%20dance' class='bbcode_tag' rel='tag'>alternative dance</a> and <a href='http://www.last.fm/tag/post%20punk' class='bbcode_tag' rel='tag'>post punk</a>, along with elements of <a href='http://www.last.fm/tag/disco' class='bbcode_tag' rel='tag'>disco</a> and other styles. <br />"    
let htmlStringData = htmlString.dataUsingEncoding(NSUTF8StringEncoding)!
let options: [String: AnyObject] = [NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType, NSCharacterEncodingDocumentAttribute: NSUTF8StringEncoding]
let attributedHTMLString = try! NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil)
let string = attributedHTMLString.string

Or, as Irshad Mohamed in the comments would do it:

let attributed = try NSAttributedString(data: htmlString.data(using: .unicode)!, options: [NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType], documentAttributes: nil)
print(attributed.string)

Mohamed solution but as a String extension in Swift 4.

extension String {

    func stripOutHtml() -> String? {
        do {
            guard let data = self.data(using: .unicode) else {
                return nil
            }
            let attributed = try NSAttributedString(data: data, options: [.documentType: NSAttributedString.DocumentType.html, .characterEncoding: String.Encoding.utf8.rawValue], documentAttributes: nil)
            return attributed.string
        } catch {
            return nil
        }
    }
}

I'm using the following extension to remove specific HTML elements:

extension String {
    func deleteHTMLTag(tag:String) -> String {
        return self.stringByReplacingOccurrencesOfString("(?i)</?\(tag)\\b[^<]*>", withString: "", options: .RegularExpressionSearch, range: nil)
    }

    func deleteHTMLTags(tags:[String]) -> String {
        var mutableString = self
        for tag in tags {
            mutableString = mutableString.deleteHTMLTag(tag)
        }
        return mutableString
    }
}

This makes it possible to only remove <a> tags from a string, eg:

let string = "my html <a href="">link text</a>"
let withoutHTMLString = string.deleteHTMLTag("a") // Will be "my  html link text"
extension String{
    var htmlStripped : String{
        return self.replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression, range: nil)
    }
}

Happy Coding

I prefer to use a regular expression than to use NSAttributedString HTML conversion, be advised that is pretty time consuming and need to be run on the main thread too. More information here: https://developer.apple.com/documentation/foundation/nsattributedstring/1524613-initwithdata

For me this made the trick, first I remove any CSS inline styling, and later all the HTML tags. Probably not solid as the NSAttributedString option, but way faster for my case.

extension String {
    func withoutHtmlTags() -> String {
        let str = self.replacingOccurrences(of: "<style>[^>]+</style>", with: "", options: .regularExpression, range: nil)
        return str.replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression, range: nil)
    }
}

Swift 5

extension String {
    public func trimHTMLTags() -> String? {
        guard let htmlStringData = self.data(using: String.Encoding.utf8) else {
            return nil
        }
    
        let options: [NSAttributedString.DocumentReadingOptionKey : Any] = [
            .documentType: NSAttributedString.DocumentType.html,
            .characterEncoding: String.Encoding.utf8.rawValue
        ]
    
        let attributedString = try? NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil)
        return attributedString?.string
    }
}

Use:

let  str = "my html <a href='https://www.google.com'>link text</a>"

print(str.trimHTMLTags() ?? "--") //"my html link text"

swift 4 :

extension String {
    func deleteHTMLTag(tag:String) -> String {
        return self.replacingOccurrences(of: "(?i)</?\(tag)\\b[^<]*>", with: "", options: .regularExpression, range: nil)
    }

    func deleteHTMLTags(tags:[String]) -> String {
        var mutableString = self
        for tag in tags {
            mutableString = mutableString.deleteHTMLTag(tag: tag)
        }
        return mutableString
    }
}

Updated for Swift 4:

guard let htmlStringData = htmlString.data(using: .unicode) else { fatalError() }

let options: [NSAttributedString.DocumentReadingOptionKey: Any] = [
                .documentType: NSAttributedString.DocumentType.html
                .characterEncoding: String.Encoding.unicode.rawValue
             ]

let attributedHTMLString = try! NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil)
let string = attributedHTMLString.string

Was able to achieve mild success using XML Tree-Based Processing and XMLDocument API (part of Foundation, so available on iOS, macOS and watchOS by default).

Made this extension. You might have to tweak it a bit according to your needs, for example:

  • The html tags found are joined by a \n , adding a new line between different <p>...</p> for example, as this worked for my own case, your case might be different. If you don't need a new line between elements, replace \n for an empty string on .joined(separator:) .

Extension

extension String {
    var removingHTML: Self? {
        let wrapped = "<xml>" + self + "</xml>"
        guard let xml = try? XMLDocument(data: .init(wrapped.utf8), options: .documentTidyXML) else { return nil }
        var children = xml.rootDocument?.children?.first?.children
        while children?.first?.name?.lowercased() == "html"
                || children?.first?.name?.lowercased() == "body" {
            children = children?
                .first {
                    $0.children?.isEmpty == false
                }?
                .children
        }
        return children
            .flatMap {
                $0.compactMap(\.stringValue)
            }?
            .joined(separator: "\n")
    }
}

Results

Using some of the examples already given on this posts

let string = "<!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>"
print(string.removingHTML) // "My First Heading\nMy first paragraph."
let string = "LCD Soundsystem was the musical project of producer <a href='http://www.last.fm/music/James+Murphy' class='bbcode_artist'>James Murphy</a>, co-founder of <a href='http://www.last.fm/tag/dance-punk' class='bbcode_tag' rel='tag'>dance-punk</a> label <a href='http://www.last.fm/label/DFA' class='bbcode_label'>DFA</a> Records. Formed in 2001 in New York City, New York, United States, the music of LCD Soundsystem can also be described as a mix of <a href='http://www.last.fm/tag/alternative%20dance' class='bbcode_tag' rel='tag'>alternative dance</a> and <a href='http://www.last.fm/tag/post%20punk' class='bbcode_tag' rel='tag'>post punk</a>, along with elements of <a href='http://www.last.fm/tag/disco' class='bbcode_tag' rel='tag'>disco</a> and other styles. <br />"
print(string.removingHTML) // LCD Soundsystem was the musical project of producer \nJames Murphy\n, co-founder of \ndance-punk\nlabel \nDFA\nRecords. Formed in 2001 in New York City, New York, United States, the music of LCD Soundsystem can also be described as a mix of \nalternative dance\nand \npost punk\n, along with elements of \ndisco\nand other styles. \n
let string = "my html <a href=\"\">link text</a>"
print(string.removingHTML) // my html \nlink text

Check how the last example might need adjusting the extension and replacing the new line for something else, according to your needs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM