简体   繁体   English

通过python正则表达式提取具有非ASCII字符的单词

[英]Extracting words with non-ASCII characters by python regular expressions

I want to extract some text that contains non-ASCII characters. 我想提取一些包含非ASCII字符的文本。 The problem is that the program considers non-ASCII as delimiters! 问题在于程序将非ASCII视为定界符! I tried this: 我尝试了这个:

regex_fmla = '(?:title=[\'"])([:/.A-z?<_&\s=>0-9;-]+)'
c1='<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'
c2= '<a href="/climate/cameroon.html" title="Climate data: Cameroon">Cameroon</a>'
c_list =[c1, c2]
for c in c_list 
    print re.findall(regex_fmla , c)

The result is: 结果是:

['Climate data: C']
['Climate data: Cameroon']

Notice that The first country is not correct, as the series broken at ô, it should be: 请注意,第一个国家/地区正确,因为在ô处打破的序列应该是:

['Climate data: Côte d\'Ivoire']

I searched in StackOverflow, and I found an answer that suggests using the flag re.UNICODE, but it returns the same wrong answer! 我在StackOverflow中进行搜索,找到了一个建议使用标志re.UNICODE的答案,但它返回的错误答案相同!

How can I fix this? 我怎样才能解决这个问题?

I would suggest using BeautifulSoup for parsing html: 我建议使用BeautifulSoup解析html:

from bs4 import BeautifulSoup as bs

c1='<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'
c2='<a href="/climate/cameroon.html" title="Climate data: Cameroon">Cameroon</a>'


for c in [c1, c2]:
    soup = bs(c, 'html.parser')
    print(soup.find('a')['title'])

for more links ( <a ...> ) use .findAll() method: 有关更多链接( <a ...> ),请使用.findAll()方法:

for c in [bightml]:
    soup = bs(c, 'html.parser')
    for a in soup.findAll('a'):
        print(a['title'])

if you need anything that has a title attribute: 如果您需要具有title属性的任何内容:

for a in soup.findAll(title=True):
    print(a['title'])

I also would suggest BeautifulSoup , but it seems you want to know how to include those special chars, you can change your regular expression to this: 我也建议使用BeautifulSoup ,但是您似乎想知道如何包含这些特殊字符,可以将正则表达式更改为:

ex = 'title="(.+?)"'

and then: 接着:

c1='<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'

for x in re.findall(ex, c1):
    print x

Output: 输出:

Climate data: Côte d'Ivoire

I suggest using beautiful soup, but if you would prefer sticking to re: 我建议使用美丽的汤,但是如果您希望坚持使用:

import re

regex_fmla = '(?:title=[\'"])([\w :\':/.]+)'

c1 = '<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'
c2 = '<a href="/climate/cameroon.html" title="Climate data: Cameroon">Cameroon</a>'
c_list = [c1, c2]

for c in c_list:
    print(re.findall(regex_fmla, c, flags=re.UNICODE))

I believe the problem that caused the re.UNICODE not to work was explicitly defining the alphabet in your expression as [A-z0-9] . 我相信导致re.UNICODE无法正常工作的问题是在表达式中将字母明确定义为[A-z0-9] If we change that to simply [\\w] then the flag works correctly 如果我们将其更改为简单的[\\w]则该标志正常工作

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM