简体   繁体   English

如何计算法语字符?

[英]How to count french characters?

I have a function that is splitting all words and checking whether it exceeds 245 characters or no.我有一个 function 正在拆分所有单词并检查它是否超过 245 个字符。 What I want to do is to check if there are any french characters (char) and count them as 2. For instance letter 'a' is counted as 1 but 'à' should be counted as 2.我想要做的是检查是否有任何法语字符(char)并将它们计为2。例如字母'a'计为1,但'à'应计为2。

def count_chars(content: str) -> str:
    chars = ['à', 'â', 'æ', 'ç', 'é', 'è', 'ê', 'ë', 'î', 'ï', 'ô', 'œ', 'ù', 'û', 'ü', 'ÿ',
             'À', 'Â', 'Æ', 'Ç', 'É', 'È', 'Ê', 'Ë', 'Î', 'Ï', 'Ô', 'Œ', 'Ù', 'Û', 'Ü', 'Ÿ']
    words = content.split()
    new_content = ''
    for word in words:
        if len(new_content + word) <= 245:
            new_content += ' ' + word
        else:
            break
    return new_content.strip()

It's probably best to keep a length variable:最好保留一个长度变量:

def count_chars(content: str) -> str:
    chars = {'à', 'â', 'æ', 'ç', 'é', 'è', 'ê', 'ë', 'î', 'ï', 'ô', 'œ', 'ù', 'û', 'ü', 'ÿ',
             'À', 'Â', 'Æ', 'Ç', 'É', 'È', 'Ê', 'Ë', 'Î', 'Ï', 'Ô', 'Œ', 'Ù', 'Û', 'Ü', 'Ÿ'}
    words = content.split()
    new_content = ''
    length = 0
    for word in words:
        word_length = sum(2 if i in chars else 1 for char in word)
        if length + word_length <= 245:
            new_content += ' ' + word
            length += word_length
        else:
            break
    return new_content.strip()

Note: I also turned chars into a set for a (theoretical) speed of O(1) .注意:我还将chars转换为一set (理论)速度为O(1)

After throughout analysis and research I found much better way to do it.经过整个分析和研究,我发现了更好的方法。 The function below is caculating bytes based on unicode standards so no need to provide with list of characters, link to documentation where everything is exaplained is here :下面的 function 正在计算基于 unicode 标准的字节数,因此无需提供字符列表,此处为说明所有内容的文档的链接:

def utf8len(s):
    return len(s.encode('utf-8'))


def trunc_string(content: str) -> str:
    # split words to count number of characters
    words = content.split()
    new_content = ''
    length = 0
    for word in words:
        # count all non-standard characters as more than 1
        word_length = utf8len(word)
        # check if number of characters is higher than 245
        if length + word_length <= 245:
            new_content += ' ' + word
            length += word_length
        else:
            break
    return new_content.strip()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM