内置函数str.lower（）如何实现？

Question

I wonder how str.lower() is implemented in Python, so I cloned the cpython repository and did some search with grep. 我想知道str.lower（）是如何在Python中实现的，所以我克隆了cpython存储库并使用grep进行了一些搜索。 After a few jumps starting from unicode_lower in Objects/unicodeobject.c , I came across to this inside Objects/unicodetype.c : 从一开始几跳后unicode_lower在Objects/unicodeobject.c ，我碰到这里面Objects/unicodetype.c ：

int _PyUnicode_ToLowerFull(Py_UCS4 ch, Py_UCS4 *res)
{
    const _PyUnicode_TypeRecord *ctype = gettyperecord(ch);

    if (ctype->flags & EXTENDED_CASE_MASK) {
        int index = ctype->lower & 0xFFFF;
        int n = ctype->lower >> 24;
        int i;
        for (i = 0; i < n; i++)
            res[i] = _PyUnicode_ExtendedCase[index + i];
        return n;
    }
    res[0] = ch + ctype->lower;
    return 1;
}

I am familiar with C, but pretty unfamiliar with how python is implemented (but want to change that!). 我对C很熟悉，但是对python的实现方式并不熟悉（但想改变它！）。 I don't really understand what is going on, so seeking help here for some clear explanation. 我不太了解发生了什么，因此请在这里寻求帮助以获取明确的解释。

Answer 1

There are two branches in the function you show. 您显示的功能中有两个分支。 Which branch runs depends on the flags field of the _PyUnicode_TypeRecord field for the character in question. 运行哪个分支取决于所讨论字符的_PyUnicode_TypeRecord字段的flags字段。 If it has the EXTENDED_CASE_MASK bit set, a more complicated bit of code runs, otherwise a simpler version is used. 如果设置了EXTENDED_CASE_MASK位，则会运行更复杂的代码，否则将使用更简单的版本。

Lets look at the simple part first: 首先让我们看一下简单的部分：

res[0] = ch + ctype->lower;
return 1;

This simply adds the value of the lower field as an offset to the input codepoint, assigns it into the first place in the res return argument and returns 1 (since it's used one character). 这只是将lower字段的值作为偏移量添加到输入代码点，将其分配到res return参数的第一位并返回1 （因为使用了一个字符）。

Now for the more complicated version: 现在，对于更复杂的版本：

int index = ctype->lower & 0xFFFF;
int n = ctype->lower >> 24;
int i;
for (i = 0; i < n; i++)
    res[i] = _PyUnicode_ExtendedCase[index + i];
return n;

In this version, the lower field is interpreted as two different numbers. 在此版本中， lower字段被解释为两个不同的数字。 The lowest 16 bits are index , while the uppermost bits become n (the number of characters to be output). 最低的16位是index ，而最高的位变为n （要输出的字符数）。 The code then loops over the n characters in the _PyUnicode_ExtendedCase array starting at index , copying them into the res array. 然后，代码循环遍历_PyUnicode_ExtendedCase数组中从index开始的n字符，并将它们复制到res数组中。 Finally it returns the number of characters used. 最后，它返回使用的字符数。

This more complicated code is needed to handle case changes for Unicode codepoints that represent a ligature of two characters (generally for obscure historical reasons, such as because they would have been on a single type block in ancient moveable type printing). 需要这种更复杂的代码来处理代表两个字符的连字的Unicode代码点的大小写更改（通常是出于模糊的历史原因，例如，因为它们在古老的可移动字体打印中位于单个字体块上）。 These ligatures may only exist in a single case if the characters in other cases don't overlap as much. 如果其他情况下的字符重叠不大，则这些连字可能仅在一种情况下存在。 As an example, the character 'ﬂ' is a ligature of the lowercase characters 'f' and 'l' . 例如，字符'ﬂ'是小写字符'f'和'l'的连字。 No uppercase version of the ligature exists, so 'ﬂ'.upper() needs to return a two-character string ( 'FL' ). 没有连字的大写版本，因此'ﬂ'.upper()需要返回两个字符的字符串（ 'FL' ）。

内置函数str.lower（）如何实现？

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-02-01 09:07:17

内置函数str.lower（）如何实现？

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-02-01 09:07:17

解决方案1
2 已采纳 2017-02-01 09:07:17