[英]extracting string using regular expression
我有以下輸入字符串
input= """href="http://www.sciencedirect.com/science/article/pii/S0167923609002097" onmousedown="return scife_clk(this.href,'','res','2')">Using <b>text mining </b>and sentiment analysis for online forums hotspot detection and forecast</a></h3><div class="gs_a">N Li, <a href="/citations?
href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309177/" onmousedown="return scife_clk(this.href,'ggp','res','1')">How to link ontologies and protein–protein interactions to literature: <b>text</b>-<b>mining </b>approaches and the BioCreative experience</a></h3><div class="gs_a"><a href="/citations?
href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309177/" onmousedown="return scife_clk(this.href,'gga','gga','1')"><span class="gs_ggsL"><span class=gs_ctg2>[HTML]</span> from nih.gov</span><span class="gs_ggsS">nih.gov <span """
我想從中提取以下輸出:
>> Using <b>text mining </b>and sentiment analysis for online forums hotspot detection and forecast
>> How to link ontologies and protein–protein interactions to literature: <b>text</b>-<b>mining </b>approaches and the BioCreative experience
我試圖在python中使用re包但我不清楚使用什么正則表達式,因為有幾個模式,如:
(this.href,'','res','2')"> or (this.href,'ggp','res','2')"> or (this.href,'gga','gga','2')">
我正在使用這個正則表達式:
=re.search(r"(this.href,'ggp.?','res','.?/D')"
但這對我不起作用。 任何人都可以說什么重新使用?
這適用於您的示例:
input= """\
href="http://www.sciencedirect.com/science/article/pii/S0167923609002097" onmousedown="return scife_clk(this.href,'','res','2')">Using <b>text mining </b>and sentiment analysis for online forums hotspot detection and forecast</a></h3><div class="gs_a">N Li, <a href="/citations?
href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309177/" onmousedown="return scife_clk(this.href,'ggp','res','1')">How to link ontologies and protein–protein interactions to literature: <b>text</b>-<b>mining </b>approaches and the BioCreative experience</a></h3><div class="gs_a"><a href="/citations?
href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309177/" onmousedown="return scife_clk(this.href,'gga','gga','1')"><span class="gs_ggsL"><span class=gs_ctg2>[HTML]</span> from nih.gov</span><span class="gs_ggsS">nih.gov <span """
import re
for line in input.splitlines():
m=re.search(r'onmousedown=.*?">(.*)</a>',line)
if m:
print(m.group(1))
打印:
Using <b>text mining </b>and sentiment analysis for online forums hotspot detection and forecast
How to link ontologies and protein–protein interactions to literature: <b>text</b>-<b>mining </b>approaches and the BioCreative experience
請記住,使用帶有HTML的正則表達式可能是一個我的領域(或思想領域!),並且通常建議使用解析器。 但是使用片段,你可以讓它工作......
使用體面的HTML Parser會好得多。 以BeautifulSoup為例:
from bs4 import BeautifulSoup
soup = BeautifulSoup(input)
for link in soup.find_all('a', onmousedown=True):
print link.text
它找到具有onmousedown
屬性的所有<a>
元素。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.