简体   繁体   中英

Using Swift, how do you re-encode then decode a String like this short script in Python?

XKCD has some issues with their API and weird encoding issues.

Minor encoding issue with xkcd alt texts in chat

The solution (in Python) is to encode it as latin1 then decode as utf8, but how do I do this in Swift?

Test string:

"Be careful\u00e2\u0080\u0094it's breeding season"

Expected output:

Be careful—it's breeding season

Python (from above link):

import json
a = '''"Be careful\u00e2\u0080\u0094it's breeding season"'''
print(json.loads(a).encode('latin1').decode('utf8'))

How is this done in Swift?

let strdata = "Be careful\\u00e2\\u0080\\u0094it's breeding season".data(using: .isoLatin1)!
let str = String(data: strdata, encoding: .utf8)

That doesn't work!

You have to decode the JSON data first, then extract the string, and finally “fix” the string. Here is a self-contained example with the JSON from https://xkcd.com/1814/info.0.json :

let data = """
    {"month": "3", "num": 1814, "link": "", "year": "2017", "news": "",
    "safe_title": "Color Pattern", "transcript": "",
    "alt": "\\u00e2\\u0099\\u00ab When the spacing is tight / And the difference is slight / That's a moir\\u00c3\\u00a9 \\u00e2\\u0099\\u00ab",
    "img": "https://imgs.xkcd.com/comics/color_pattern.png",
    "title": "Color Pattern", "day": "22"}
""".data(using: .utf8)!

// Alternatively:
// let url = URL(string: "https://xkcd.com/1814/info.0.json")!
// let data = try! Data(contentsOf: url)

do {
    if let dict = (try JSONSerialization.jsonObject(with: data, options: [])) as? [String: Any],
        var alt = dict["alt"] as? String {

        // Now try fix the "alt" string
        if let isoData = alt.data(using: .isoLatin1),
            let altFixed = String(data: isoData, encoding: .utf8) {
            alt = altFixed
        }

        print(alt)
        // ♫ When the spacing is tight / And the difference is slight / That's a moiré ♫
    }
} catch {
    print(error)
}

If you have just a string of the form

Be careful\â\€\”it's breeding season

then you can still use JSONSerialization to decode the \\uNNNN escape sequences, and then continue as above.

A simple example (error checking omitted for brevity):

let strbad = "Be careful\\u00e2\\u0080\\u0094it's breeding season"
let decoded = try! JSONSerialization.jsonObject(with: Data("\"\(strbad)\"".utf8), options: .allowFragments) as! String
let strgood = String(data: decoded.data(using: .isoLatin1)!, encoding: .utf8)!
print(strgood)
// Be careful—it's breeding season

I couldn't find anything built in, but I did manage to write this for you.

extension String {
    func range(nsRange: NSRange) -> Range<Index> {
        return Range(nsRange, in: self)!
    }

    func nsRange(range: Range<Index>) -> NSRange {
        return NSRange(range, in: self)
    }

    var fullRange: Range<Index> {
        return startIndex..<endIndex
    }

    var fullNSRange: NSRange {
        return nsRange(range: fullRange)
    }

    subscript(nsRange: NSRange) -> Substring {
        return self[range(nsRange: nsRange)]
    }

    func convertingUnicodeCharacters() -> String {
        var string = self
        // Characters need to be replaced in groups in case of clusters
        let groupedRegex = try! NSRegularExpression(pattern: "(\\\\u[0-9a-fA-F]{1,8})+")
        for match in groupedRegex.matches(in: string, range: string.fullNSRange).reversed() {
            let groupedHexValues = String(string[match.range])
            var characters = [Character]()
            let regex = try! NSRegularExpression(pattern: "\\\\u([0-9a-fA-F]{1,8})")
            for hexMatch in regex.matches(in: groupedHexValues, range: groupedHexValues.fullNSRange) {
                let hexString = groupedHexValues[Range(hexMatch.range(at: 1), in: string)!]
                if let hexValue = UInt32(hexString, radix: 16),
                    let scalar = UnicodeScalar(hexValue) {
                    characters.append(Character(scalar))
                }
            }
            string.replaceSubrange(Range(match.range, in: string)!, with: characters)
        }
        return string
    }
}

It basically looks for any \\u\u0026lt;1-8 digit hex> values and converts them into scalars. Should be fairly straightforward... 🧐 I've tried to test it a fair but but not sure if it catches every edge case.

My playground testing code was simply:

let string = "Be careful\\u00e2\\u0080\\u0094-\\u1F496\\u65\\u301it's breeding season"
let expected = "Be careful\u{00e2}\u{0080}\u{0094}-\u{1f496}\u{65}\u{301}it's breeding season"
string.convertingUnicodeCharacters() == expected // true 🎉

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM