简体   繁体   English

使用 Swift,你如何在 Python 中重新编码然后解码像这个简短脚本的字符串?

[英]Using Swift, how do you re-encode then decode a String like this short script in Python?

XKCD has some issues with their API and weird encoding issues. XKCD 的 API 和奇怪的编码问题有一些问题。

Minor encoding issue with xkcd alt texts in chat 聊天中 xkcd alt 文本的小编码问题

The solution (in Python) is to encode it as latin1 then decode as utf8, but how do I do this in Swift?解决方案(在 Python 中)是将其编码为 latin1 然后解码为 utf8,但是我如何在 Swift 中做到这一点?

Test string:测试字符串:

"Be careful\u00e2\u0080\u0094it's breeding season"

Expected output:预期输出:

Be careful—it's breeding season

Python (from above link): Python(来自上面的链接):

import json
a = '''"Be careful\u00e2\u0080\u0094it's breeding season"'''
print(json.loads(a).encode('latin1').decode('utf8'))

How is this done in Swift?这是如何在 Swift 中完成的?

let strdata = "Be careful\\u00e2\\u0080\\u0094it's breeding season".data(using: .isoLatin1)!
let str = String(data: strdata, encoding: .utf8)

That doesn't work!那不行!

You have to decode the JSON data first, then extract the string, and finally “fix” the string.您必须先解码 JSON 数据,然后提取字符串,最后“修复”字符串。 Here is a self-contained example with the JSON from https://xkcd.com/1814/info.0.json :这是来自https://xkcd.com/1814/info.0.json的 JSON 的自包含示例:

let data = """
    {"month": "3", "num": 1814, "link": "", "year": "2017", "news": "",
    "safe_title": "Color Pattern", "transcript": "",
    "alt": "\\u00e2\\u0099\\u00ab When the spacing is tight / And the difference is slight / That's a moir\\u00c3\\u00a9 \\u00e2\\u0099\\u00ab",
    "img": "https://imgs.xkcd.com/comics/color_pattern.png",
    "title": "Color Pattern", "day": "22"}
""".data(using: .utf8)!

// Alternatively:
// let url = URL(string: "https://xkcd.com/1814/info.0.json")!
// let data = try! Data(contentsOf: url)

do {
    if let dict = (try JSONSerialization.jsonObject(with: data, options: [])) as? [String: Any],
        var alt = dict["alt"] as? String {

        // Now try fix the "alt" string
        if let isoData = alt.data(using: .isoLatin1),
            let altFixed = String(data: isoData, encoding: .utf8) {
            alt = altFixed
        }

        print(alt)
        // ♫ When the spacing is tight / And the difference is slight / That's a moiré ♫
    }
} catch {
    print(error)
}

If you have just a string of the form如果你只有一个表格字符串

Be careful\â\€\”it's breeding season小心\â\€\”现在是繁殖季节

then you can still use JSONSerialization to decode the \\uNNNN escape sequences, and then continue as above.那么你仍然可以使用JSONSerialization来解码\\uNNNN转义序列,然后继续如上。

A simple example (error checking omitted for brevity):一个简单的例子(为简洁起见省略了错误检查):

let strbad = "Be careful\\u00e2\\u0080\\u0094it's breeding season"
let decoded = try! JSONSerialization.jsonObject(with: Data("\"\(strbad)\"".utf8), options: .allowFragments) as! String
let strgood = String(data: decoded.data(using: .isoLatin1)!, encoding: .utf8)!
print(strgood)
// Be careful—it's breeding season

I couldn't find anything built in, but I did manage to write this for you.我找不到任何内置的东西,但我确实设法为你写了这个。

extension String {
    func range(nsRange: NSRange) -> Range<Index> {
        return Range(nsRange, in: self)!
    }

    func nsRange(range: Range<Index>) -> NSRange {
        return NSRange(range, in: self)
    }

    var fullRange: Range<Index> {
        return startIndex..<endIndex
    }

    var fullNSRange: NSRange {
        return nsRange(range: fullRange)
    }

    subscript(nsRange: NSRange) -> Substring {
        return self[range(nsRange: nsRange)]
    }

    func convertingUnicodeCharacters() -> String {
        var string = self
        // Characters need to be replaced in groups in case of clusters
        let groupedRegex = try! NSRegularExpression(pattern: "(\\\\u[0-9a-fA-F]{1,8})+")
        for match in groupedRegex.matches(in: string, range: string.fullNSRange).reversed() {
            let groupedHexValues = String(string[match.range])
            var characters = [Character]()
            let regex = try! NSRegularExpression(pattern: "\\\\u([0-9a-fA-F]{1,8})")
            for hexMatch in regex.matches(in: groupedHexValues, range: groupedHexValues.fullNSRange) {
                let hexString = groupedHexValues[Range(hexMatch.range(at: 1), in: string)!]
                if let hexValue = UInt32(hexString, radix: 16),
                    let scalar = UnicodeScalar(hexValue) {
                    characters.append(Character(scalar))
                }
            }
            string.replaceSubrange(Range(match.range, in: string)!, with: characters)
        }
        return string
    }
}

It basically looks for any \\u\u0026lt;1-8 digit hex> values and converts them into scalars.它基本上查找任何\\u\u0026lt;1-8 digit hex>值并将它们转换为标量。 Should be fairly straightforward... 🧐 I've tried to test it a fair but but not sure if it catches every edge case.应该相当简单......

My playground testing code was simply:我的游乐场测试代码很简单:

let string = "Be careful\\u00e2\\u0080\\u0094-\\u1F496\\u65\\u301it's breeding season"
let expected = "Be careful\u{00e2}\u{0080}\u{0094}-\u{1f496}\u{65}\u{301}it's breeding season"
string.convertingUnicodeCharacters() == expected // true 🎉

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM