简体   繁体   English

内置函数str.lower()如何实现?

[英]How is the built-in function str.lower() implemented?

I wonder how str.lower() is implemented in Python, so I cloned the cpython repository and did some search with grep. 我想知道str.lower()是如何在Python中实现的,所以我克隆了cpython存储库并使用grep进行了一些搜索。 After a few jumps starting from unicode_lower in Objects/unicodeobject.c , I came across to this inside Objects/unicodetype.c : 从一开始几跳后unicode_lowerObjects/unicodeobject.c ,我碰到这里面Objects/unicodetype.c

int _PyUnicode_ToLowerFull(Py_UCS4 ch, Py_UCS4 *res)
{
    const _PyUnicode_TypeRecord *ctype = gettyperecord(ch);

    if (ctype->flags & EXTENDED_CASE_MASK) {
        int index = ctype->lower & 0xFFFF;
        int n = ctype->lower >> 24;
        int i;
        for (i = 0; i < n; i++)
            res[i] = _PyUnicode_ExtendedCase[index + i];
        return n;
    }
    res[0] = ch + ctype->lower;
    return 1;
}

I am familiar with C, but pretty unfamiliar with how python is implemented (but want to change that!). 我对C很熟悉,但是对python的实现方式并不熟悉(但想改变它!)。 I don't really understand what is going on, so seeking help here for some clear explanation. 我不太了解发生了什么,因此请在这里寻求帮助以获取明确的解释。

There are two branches in the function you show. 您显示的功能中有两个分支。 Which branch runs depends on the flags field of the _PyUnicode_TypeRecord field for the character in question. 运行哪个分支取决于所讨论字符的_PyUnicode_TypeRecord字段的flags字段。 If it has the EXTENDED_CASE_MASK bit set, a more complicated bit of code runs, otherwise a simpler version is used. 如果设置了EXTENDED_CASE_MASK位,则会运行更复杂的代码,否则将使用更简单的版本。

Lets look at the simple part first: 首先让我们看一下简单的部分:

res[0] = ch + ctype->lower;
return 1;

This simply adds the value of the lower field as an offset to the input codepoint, assigns it into the first place in the res return argument and returns 1 (since it's used one character). 这只是将lower字段的值作为偏移量添加到输入代码点,将其分配到res return参数的第一位并返回1 (因为使用了一个字符)。

Now for the more complicated version: 现在,对于更复杂的版本:

int index = ctype->lower & 0xFFFF;
int n = ctype->lower >> 24;
int i;
for (i = 0; i < n; i++)
    res[i] = _PyUnicode_ExtendedCase[index + i];
return n;

In this version, the lower field is interpreted as two different numbers. 在此版本中, lower字段被解释为两个不同的数字。 The lowest 16 bits are index , while the uppermost bits become n (the number of characters to be output). 最低的16位是index ,而最高的位变为n (要输出的字符数)。 The code then loops over the n characters in the _PyUnicode_ExtendedCase array starting at index , copying them into the res array. 然后,代码循环遍历_PyUnicode_ExtendedCase数组中从index开始的n字符,并将它们复制到res数组中。 Finally it returns the number of characters used. 最后,它返回使用的字符数。

This more complicated code is needed to handle case changes for Unicode codepoints that represent a ligature of two characters (generally for obscure historical reasons, such as because they would have been on a single type block in ancient moveable type printing). 需要这种更复杂的代码来处理代表两个字符的连字的Unicode代码点的大小写更改(通常是出于模糊的历史原因,例如,因为它们在古老的可移动字体打印中位于单个字体块上)。 These ligatures may only exist in a single case if the characters in other cases don't overlap as much. 如果其他情况下的字符重叠不大,则这些连字可能仅在一种情况下存在。 As an example, the character 'fl' is a ligature of the lowercase characters 'f' and 'l' . 例如,字符'fl'是小写字符'f''l'的连字。 No uppercase version of the ligature exists, so 'fl'.upper() needs to return a two-character string ( 'FL' ). 没有连字的大写版本,因此'fl'.upper()需要返回两个字符的字符串( 'FL' )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM