使用正则表达式提取字符串

Question

我有以下输入字符串

input= """href="http://www.sciencedirect.com/science/article/pii/S0167923609002097" onmousedown="return scife_clk(this.href,'','res','2')">Using <b>text mining </b>and sentiment analysis for online forums hotspot detection and forecast</a></h3><div class="gs_a">N Li, <a href="/citations?
    href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309177/" onmousedown="return scife_clk(this.href,'ggp','res','1')">How to link ontologies and protein–protein interactions to literature: <b>text</b>-<b>mining </b>approaches and the BioCreative experience</a></h3><div class="gs_a"><a href="/citations?
    href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309177/" onmousedown="return scife_clk(this.href,'gga','gga','1')"><span class="gs_ggsL"><span class=gs_ctg2>[HTML]</span> from nih.gov</span><span class="gs_ggsS">nih.gov <span """

我想从中提取以下输出：

>> Using <b>text mining </b>and sentiment analysis for online forums hotspot detection and forecast
>> How to link ontologies and protein–protein interactions to literature: <b>text</b>-<b>mining </b>approaches and the BioCreative experience

我试图在python中使用re包但我不清楚使用什么正则表达式，因为有几个模式，如：

(this.href,'','res','2')"> or (this.href,'ggp','res','2')"> or (this.href,'gga','gga','2')">

我正在使用这个正则表达式：

=re.search(r"(this.href,'ggp.?','res','.?/D')"

但这对我不起作用。 任何人都可以说什么重新使用？

Answer 1

这适用于您的示例：

input= """\
href="http://www.sciencedirect.com/science/article/pii/S0167923609002097" onmousedown="return scife_clk(this.href,'','res','2')">Using <b>text mining </b>and sentiment analysis for online forums hotspot detection and forecast</a></h3><div class="gs_a">N Li, <a href="/citations?
href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309177/" onmousedown="return scife_clk(this.href,'ggp','res','1')">How to link ontologies and protein–protein interactions to literature: <b>text</b>-<b>mining </b>approaches and the BioCreative experience</a></h3><div class="gs_a"><a href="/citations?
href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309177/" onmousedown="return scife_clk(this.href,'gga','gga','1')"><span class="gs_ggsL"><span class=gs_ctg2>[HTML]</span> from nih.gov</span><span class="gs_ggsS">nih.gov <span """

import re

for line in input.splitlines():
    m=re.search(r'onmousedown=.*?">(.*)</a>',line)
    if m:
        print(m.group(1))

打印：

Using <b>text mining </b>and sentiment analysis for online forums hotspot detection and forecast
How to link ontologies and protein–protein interactions to literature: <b>text</b>-<b>mining </b>approaches and the BioCreative experience

请记住，使用带有HTML的正则表达式可能是一个我的领域（或思想领域！），并且通常建议使用解析器。 但是使用片段，你可以让它工作......

Answer 2

使用体面的HTML Parser会好得多。 以BeautifulSoup为例：

from bs4 import BeautifulSoup

soup = BeautifulSoup(input)

for link in soup.find_all('a', onmousedown=True):
    print link.text

它找到具有onmousedown属性的所有<a>元素。

使用正则表达式提取字符串

问题描述

2 个解决方案

解决方案1
1 已采纳 2013-04-15 16:02:38

解决方案2
1 2013-04-15 16:03:44

使用正则表达式提取字符串

问题描述

2 个解决方案

解决方案1 1 已采纳 2013-04-15 16:02:38

解决方案2 1 2013-04-15 16:03:44

解决方案1
1 已采纳 2013-04-15 16:02:38

解决方案2
1 2013-04-15 16:03:44