[英]Filter string with non-English character in Python
I have this list for example例如,我有这个列表
items = ['www1bmobạnk.com', 'наш-переславль.рф', 'вольттек.рф', '別邸福の花浜松町店.com', 'благовест.рус', 'թարմլուր.հայ', '피시방.com', 'ått.com', '沃華科技.com', 'впамяти.рф', 'андрейбабкин.рф', '꽃셰프가.com', 'фортуна36.рф', 'わかば.com', 'тесты-на-коронавирус.рф', '第一個夏天.com', 'bëstchange.net', 'normaldomain.com']
I'm using re.match
to filter them out我正在使用
re.match
过滤掉它们
for item in items:
if re.match('[a-z0-9]', item):
print("It's English! " + item)
else:
print("It's not! " + item)
The issue is bëstchange.net
and www1bmobạnk.com
didn't get filtered.问题是
bëstchange.net
和www1bmobạnk.com
没有被过滤。 I added [^ë]
and it worked for bëstchange.net
, but [^ạ]
didn't work for www1bmobạnk.com
.我添加了
[^ë]
并且它适用于bëstchange.net
,但[^ạ]
不适用于www1bmobạnk.com
。 I also try Unicode [\a-\z]
, but it's pretty much the same thing.我也尝试过 Unicode
[\a-\z]
,但它几乎是一样的。
Appreciate any suggestions!感谢任何建议!
You can check if a string contains a letter that is not an ASCII letter:您可以检查字符串是否包含不是 ASCII 字母的字母:
import re
items = ['www1bmobạnk.com', 'наш-переславль.рф', 'вольттек.рф', '別邸福の花浜松町店.com', 'благовест.рус', 'թարմլուր.հայ', '피시방.com', 'ått.com', '沃華科技.com', 'впамяти.рф', 'андрейбабкин.рф', '꽃셰프가.com', 'фортуна36.рф', 'わかば.com', 'тесты-на-коронавирус.рф', '第一個夏天.com', 'bëstchange.net', 'normaldomain.com']
for item in items:
if not re.search(r'(?![a-zA-Z])[^\W\d_]', item):
print(f"It's English! {item}")
else:
print(f"It's not! {item}")
See the Python demo .请参阅Python 演示。 Only
normaldomain.com
passes the test now.现在只有
normaldomain.com
通过测试。
The (?![a-zA-Z])[^\\W\\d_]
pattern matches any Unicode letter (with [^\\W\\d_]
) but the (?![a-zA-Z])
negative lookahead "tempers", restricts this pattern so that it could not match an ASCII letter. (?![a-zA-Z])[^\\W\\d_]
模式匹配任何 Unicode 字母(带有[^\\W\\d_]
),但(?![a-zA-Z])
负前瞻“ temps”,限制此模式,使其无法匹配 ASCII 字母。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.