简体   繁体   中英

What does normalize in unicodedata means?

I am very new to the encoding/decoding part and would like to know why...I have a dictionary, I wonder why normalization need to be applied in this case when the key is added? Does it has anything to do with the previous key and the new key? What happen if I don't normalize?

with open('file.csv') as input_file:
    reader = csv.DictReader(input_file)

    for row in reader:

        pre_key = ['One sample', 'Two samples', 'Three samples']
        new_key = ['one_sample', 'two_Samples', 'three_samples']
        my_dict['index'][new_key] = unicodedata.normalize("NFKD", 
        row.get(pre_key, None))

Normalization is not about encoding and decoding, but a "normal" (expected) form to represent a character.

The classic example is about a character with an accent. Often such characters have two representation, one with the base character codepoint and then a combining codepoint describing the accent, and often the second one with just one codepoint (describing character and accent).

Additionally, sometime you have two or more accent (and descents, dots, etc.). In this case, you may want them in a specific order.

Unicode add new characters and codepoints. You may have some old typographic way to describe a letter (or kanji). On some context (displaying) it is important to make the distinction (also in English, in past letter s had two representation), but to read or to analyse, one want the semantic letter (so normalized).

And there are few cases where you may have unnecessary characters (eg if you type in a "unicode keyboard").

So why do we need normalization?

  • the simple case: we should compare strings: visually and semantically the same string could be represented into different form, so we choose a normalization form, so that we can compare strings.

  • collation (sorting) algorithms work a lot better (less special cases), if we have to handle just one form, but also to change case (lower case, upper case), it is better to have a single form to handle.

  • handling strings can be easier: if you should remove accents, the easy way it is to use a decomposition form, and then you remove the combining characters.

  • to encode in other character set, it is better to have a composite form (or both): if the target charset has the composite, transcode it, else: there are many ways to handle it.

So "normalize" means to transform the same string into an unique Unicode representation. The canonical transformation uses a strict definition of same ; instead the compatibility normalization interpret the previous same into something like *it should have been the same, if we follow the Unicode philosophy, but practice we had to make some codepoints different to the preferred one*. So in compatibility normalization we could lose some semantics, and a pure/ideal Unicode string should never have a "compatibility" character.

In your case: the csv file could be edited by different editors, so with different convention on how to represent accented characters. So with normalization, you are sure that the same key will be encoded as same entry in the dictionary.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM