通过python正则表达式提取具有非ASCII字符的单词

Question

I want to extract some text that contains non-ASCII characters. 我想提取一些包含非ASCII字符的文本。 The problem is that the program considers non-ASCII as delimiters! 问题在于程序将非ASCII视为定界符！ I tried this: 我尝试了这个：

regex_fmla = '(?:title=[\'"])([:/.A-z?<_&\s=>0-9;-]+)'
c1='<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'
c2= '<a href="/climate/cameroon.html" title="Climate data: Cameroon">Cameroon</a>'
c_list =[c1, c2]
for c in c_list 
    print re.findall(regex_fmla , c)

The result is: 结果是：

['Climate data: C']
['Climate data: Cameroon']

Notice that The first country is not correct, as the series broken at ô, it should be: 请注意，第一个国家/地区不正确，因为在ô处打破的序列应该是：

['Climate data: Côte d\'Ivoire']

I searched in StackOverflow, and I found an answer that suggests using the flag re.UNICODE, but it returns the same wrong answer! 我在StackOverflow中进行搜索，找到了一个建议使用标志re.UNICODE的答案，但它返回的错误答案相同！

How can I fix this? 我怎样才能解决这个问题？

Answer 1

I would suggest using BeautifulSoup for parsing html: 我建议使用BeautifulSoup解析html：

from bs4 import BeautifulSoup as bs

c1='<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'
c2='<a href="/climate/cameroon.html" title="Climate data: Cameroon">Cameroon</a>'


for c in [c1, c2]:
    soup = bs(c, 'html.parser')
    print(soup.find('a')['title'])

for more links ( <a ...> ) use .findAll() method: 有关更多链接（ <a ...> ），请使用.findAll()方法：

for c in [bightml]:
    soup = bs(c, 'html.parser')
    for a in soup.findAll('a'):
        print(a['title'])

if you need anything that has a title attribute: 如果您需要具有title属性的任何内容：

for a in soup.findAll(title=True):
    print(a['title'])

Answer 2

I also would suggest BeautifulSoup , but it seems you want to know how to include those special chars, you can change your regular expression to this: 我也建议使用BeautifulSoup ，但是您似乎想知道如何包含这些特殊字符，可以将正则表达式更改为：

ex = 'title="(.+?)"'

and then: 接着：

c1='<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'

for x in re.findall(ex, c1):
    print x

Output: 输出：

Climate data: Côte d'Ivoire

Answer 3

I suggest using beautiful soup, but if you would prefer sticking to re: 我建议使用美丽的汤，但是如果您希望坚持使用：

import re

regex_fmla = '(?:title=[\'"])([\w :\':/.]+)'

c1 = '<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'
c2 = '<a href="/climate/cameroon.html" title="Climate data: Cameroon">Cameroon</a>'
c_list = [c1, c2]

for c in c_list:
    print(re.findall(regex_fmla, c, flags=re.UNICODE))

I believe the problem that caused the re.UNICODE not to work was explicitly defining the alphabet in your expression as [A-z0-9] . 我相信导致re.UNICODE无法正常工作的问题是在表达式中将字母明确定义为[A-z0-9] 。 If we change that to simply [\\w] then the flag works correctly 如果我们将其更改为简单的[\\w]则该标志正常工作

通过python正则表达式提取具有非ASCII字符的单词

问题描述

3 个解决方案

解决方案1
6 已采纳 2016-12-25 10:50:51

解决方案2
2 2016-12-25 10:57:23

解决方案3
0 2016-12-25 11:00:42

通过python正则表达式提取具有非ASCII字符的单词

问题描述

3 个解决方案

解决方案1 6 已采纳 2016-12-25 10:50:51

解决方案2 2 2016-12-25 10:57:23

解决方案3 0 2016-12-25 11:00:42

解决方案1
6 已采纳 2016-12-25 10:50:51

解决方案2
2 2016-12-25 10:57:23

解决方案3
0 2016-12-25 11:00:42