简体   繁体   English

通过一组字符串将一个字符串拆分为一个列表

[英]Split a string into a list by a set of strings

I am dealing with words written in Uzbek language.我正在处理用乌兹别克语写的文字。 The language has the following letters:该语言有以下字母:

alphabet = ["a", "b", "c", "d", "e", "f", "g", "g'", "h", "i", 
    "j", "k", "l", "m", "n", "ng", "o", "o'", "p", "q", "r", 
    "s", "sh", "t", "u", "v", "x", "y", "z"]

As you can see, there are letters with multiple characters like o' , g' and sh .如您所见,有些字母包含多个字符,例如o'g'sh How can I split a word in this language into a list of Uzbek letters?如何将这种语言的单词拆分为乌兹别克语字母列表? So, for example, splitting the word "o'zbek" into ["o'", "z", "b", "e", "k"] .因此,例如,将单词"o'zbek"拆分为["o'", "z", "b", "e", "k"]

If I do the following:如果我执行以下操作:

word = "o'zbek"
letters = list(word)

It results in:结果是:

['o', "'", 'z', 'b', 'e', 'k']

which is incorrect as o and ' are not together.这是不正确的,因为o'不在一起。

I also tried using regex like this:我也尝试过像这样使用正则表达式

import re
expression = "|".join(alphabet)
re.split(expression, word)

But it results in:但这会导致:

['', "'", '', '', '', '']

To give priority to the more-than-one-character letters, first we sort the alphabet over the length of characters.为了优先考虑多于一个字符的字母,首先我们按照字符长度对字母表进行排序。 Then pass it to a regex as you did with "|".join , and re.findall gives the list of splits:然后像使用"|".join一样将其传递给正则表达式,然后re.findall给出拆分列表:

import re

sorted_alphabet = sorted(alphabet, key=len, reverse=True)
regex = re.compile("|".join(sorted_alphabet))

def split_word(word):
    return re.findall(regex, word)

using:使用:

>>> split_word("o'zbek")
["o'", 'z', 'b', 'e', 'k']

>>> split_word("asha")
['a', 'sh', 'a']

Something like this works.像这样的东西有效。

double = {"o'", "ng", "g'", "sh"}

string = "o'zbek"
letters = []
while string:
    if string[:2] in double:
        letters.append(string[:2])
        string = string[2:]
    else:
        letters.append(string[0])
        string = string[1:]

If there are no triple letters or longer, you can list all the double letters in a set (finding an element in set is faster than finding it in list).如果没有三个字母或更长的字母,您可以列出集合中的所有双字母(在集合中查找元素比在列表中查找更快)。

Than you go through the string, and try to find the double letters at the beginning of the string.比你 go 通过字符串,并尝试找到字符串开头的双字母。 If it is there, you store that in the list of letters.如果它在那里,则将其存储在字母列表中。

import re
letters = re.findall("(o'|g'|ng|sh|[a-z])", string)

works too.也可以。

If you are looking for regex specifically, you could try to use re.findall with a pattern like so:如果您正在寻找专门的正则表达式,您可以尝试使用re.findall与这样的模式:

[a-fh-mp-rt-z]|[go]'?|ng?|sh?
  • [a-fh-mp-rt-z] - A character class holding all normal alphabets. [a-fh-mp-rt-z] - 一个字符 class 包含所有普通字母。
  • | : Or: : 或者:
  • [go]'? - Either "g" or "o" followed by an optional quote. - "g" 或 "o" 后跟可选引号。
  • | - Or: - 或者:
  • ng? - A literal "n" followed by an optional "g". - 文字“n”后跟可选的“g”。
  • | - Or: - 或者:
  • sh? - A literal "s" followed by an optional "h". - 文字“s”后跟可选的“h”。

See the online demo查看在线演示

import re
word = "o'zbek"
letters = re.findall("[a-fh-mp-rt-z]|[go]'?|ng?|sh?", word)
print(letters)

Prints:印刷:

["o'", 'z', 'b', 'e', 'k']

Note that you could also give priority to those "double" letters like so: [go]'|ng|sh|[az] , kind of like how @MustafaAydin explained in his answer .请注意,您也可以像这样优先考虑那些“双”字母: [go]'|ng|sh|[az] ,有点像@MustafaAydin 在他的回答中的解释。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM