简体   繁体   English

Python-从列表中删除元素(外来字符)

[英]Python - remove elements (foreign characters) from list

I have a python list with foreign characters that are denoted by some unicode values: 我有一个带有一些unicode值表示的带有外来字符的python列表:

python_list = ['to', 'shrink', u'\u7e2e\u3080', u'\u3061\u3062\u3080', 'chijimu', 'tizimu', 'tidimu', 'to', 'continue', u'\u7d9a\u304f', u'\u3064\u3065\u304f', 'tsuzuku', 'tuzuku', 'tuduku', u'\u30ed\u30fc\u30de\u5b57\uff08\u30ed\u30fc\u30de\u3058\uff09\u3068\u306f\u3001\u4eee\u540d\u6587\u5b57\u3092\u30e9\u30c6\u30f3\u6587\u5b57\u306b\u8ee2\u5199\u3059\u308b\u969b\u306e\u898f\u5247\u5168\u822c\uff08\u30ed\u30fc\u30de\u5b57\u8868\u8a18\u6cd5\uff09\u3001\u307e\u305f\u306f\u30e9\u30c6\u30f3\u6587\u5b57\u3067\u8868\u8a18\u3055\u308c\u305f\u65e5\u672c\u8a9e\uff08\u30ed\u30fc\u30de\u5b57\u3064\u3065\u308a\u306e\u65e5\u672c\u8a9e\uff09\u3092\u8868\u3059\u3002']  

I need to remove all the items with '\縮 ' or other similar types . 我需要删除所有带有'\\ u7e2e'或其他类似类型的项目。 If the item in list contains even 1 ascii letter or word , it shouldn't be excluded. 如果列表中的项目甚至包含1个ascii字母或单词,则不应将其排除。 for eg: 'China\ぢ' should be included. 例如: 'China\ぢ'应包括在内。 I referred to this question and realized there's something related to values greater than 128. tried different approaches like this one: 我提到了这个问题,并意识到存在与大于128的值有关的东西。尝试了类似的方法:

new_list = [item for item in python_list if ord(item) < 128]  

but this returns an error: 但这返回一个错误:

TypeError: ord() expected a character, but string of length 2 found

Expected Output: 预期产量:

new_list = ['to', 'shrink','chijimu', 'tizimu', 'tidimu', 'to', 'continue','tsuzuku', 'tuzuku', 'tuduku']

How should I go about this one?? 我该怎么办?

If you wish to keep all words that have at least one ascii letter in them then the code below will do this 如果您希望保留所有带有至少一个ascii字母的单词,则下面的代码将执行此操作

from string import ascii_letters, punctuation

python_list = ['to', 'shrink', u'\u7e2e\u3080', u'\u3061\u3062\u3080', 
               'chijimu','china,', 'tizimu', 'tidimu', 'to', 'continue', 
               u'\u7d9a\u304f', u'\u3064\u3065\u304f', 'tsuzuku', 'tuzuku', 'tuduku', u'china\u3061']

allowed = set(ascii_letters)

output = [word for word in python_list if any(letter in allowed for letter in word)]
print(output)
# ['to',
#  'shrink',
#  'chijimu',
#  'china,',
#  'tizimu',
#  'tidimu',
#  'to',
#  'continue'
#  'tsuzuku',
#  'tuzuku',
#  'tuduku',
#  'china?']

This will iterate through each letter of each word and if a single letter is also contained in allowed then it will add the word to your output list. 这将迭代每个单词的每个字母,如果allowed的单词中也包含单个字母,则会将该单词添加到您的output列表中。

您可以这样处理,因为您想保留字符串并删除unicode,

new_list = [item for item in python_list if isinstance(item, str)]

Here's one way: 这是一种方法:

import string
python_list = ['to', 'shrink', u'\u7e2e\u3080', u'\u3061\u3062\u3080', 'chijimu', 'tizimu', 'tidimu', 'to', 'continue', u'\u7d9a\u304f', u'\u3064\u3065\u304f', 'tsuzuku', 'tuzuku', 'tuduku', u'\u30ed\u30fc\u30de\u5b57\uff08\u30ed\u30fc\u30de\u3058\uff09\u3068\u306f\u3001\u4eee\u540d\u6587\u5b57\u3092\u30e9\u30c6\u30f3\u6587\u5b57\u306b\u8ee2\u5199\u3059\u308b\u969b\u306e\u898f\u5247\u5168\u822c\uff08\u30ed\u30fc\u30de\u5b57\u8868\u8a18\u6cd5\uff09\u3001\u307e\u305f\u306f\u30e9\u30c6\u30f3\u6587\u5b57\u3067\u8868\u8a18\u3055\u308c\u305f\u65e5\u672c\u8a9e\uff08\u30ed\u30fc\u30de\u5b57\u3064\u3065\u308a\u306e\u65e5\u672c\u8a9e\uff09\u3092\u8868\u3059\u3002']
filtered = [s for s in python_list if all(c in string.ascii_letters for c in s)]
print(filtered)

Output: 输出:

['to', 'shrink', 'chijimu', 'tizimu', 'tidimu', 'to', 'continue', 'tsuzuku', 'tuzuku', 'tuduku']

Yet another way: 另一种方式:

new_list=[]
for word in python_list:
    if word.encode('utf-8').decode('ascii','ignore') !='':
        new_list.append(word)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM