簡體   English   中英

忽略正則表達式中的常規HTML標簽

[英]Ignore regular HTML tags in regex

我需要在難看的HTML文件的文本中找到模式。 這很丑陋,因為每個字符都包裹在絕對位置的<span> ,並且每個<span>都位於自己的行上,如下所示:

<span style="position:absolute; color:black; left:422px; top:3497px; font-size:21.6px;">M</span>
<span style="position:absolute; color:black; left:440px; top:3497px; font-size:21.6px;">T</span>
<span style="position:absolute; color:black; left:452px; top:3497px; font-size:21.6px;">V</span>
<span style="position:absolute; color:black; left:464px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:470px; top:3497px; font-size:21.6px;">N</span>
<span style="position:absolute; color:black; left:484px; top:3497px; font-size:21.6px;">e</span>
<span style="position:absolute; color:black; left:493px; top:3497px; font-size:21.6px;">t</span>
<span style="position:absolute; color:black; left:499px; top:3497px; font-size:21.6px;">w</span>
<span style="position:absolute; color:black; left:513px; top:3497px; font-size:21.6px;">o</span>
<span style="position:absolute; color:black; left:523px; top:3497px; font-size:21.6px;">r</span>
<span style="position:absolute; color:black; left:531px; top:3497px; font-size:21.6px;">k</span>
<span style="position:absolute; color:black; left:541px; top:3497px; font-size:21.6px;">s</span>
<span style="position:absolute; color:black; left:549px; top:3497px; font-size:21.6px;">,</span>
<span style="position:absolute; color:black; left:554px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:559px; top:3497px; font-size:21.6px;">I</span>
<span style="position:absolute; color:black; left:566px; top:3497px; font-size:21.6px;">n</span>
<span style="position:absolute; color:black; left:577px; top:3497px; font-size:21.6px;">c</span>
<span style="position:absolute; color:black; left:586px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:592px; top:3497px; font-size:21.6px;">,</span>
<span style="position:absolute; color:black; left:597px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:602px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:613px; top:3497px; font-size:21.6px;">5</span>
<span style="position:absolute; color:black; left:623px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:634px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:639px; top:3497px; font-size:21.6px;">F</span>
<span style="position:absolute; color:black; left:650px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:656px; top:3497px; font-size:21.6px;">3</span>
<span style="position:absolute; color:black; left:666px; top:3497px; font-size:21.6px;">d</span>
<span style="position:absolute; color:black; left:677px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:682px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:693px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:703px; top:3497px; font-size:21.6px;">0</span>
<span style="position:absolute; color:black; left:714px; top:3497px; font-size:21.6px;">9</span>
<span style="position:absolute; color:black; left:724px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:729px; top:3497px; font-size:21.6px;">(</span>
<span style="position:absolute; color:black; left:736px; top:3497px; font-size:21.6px;">9</span>
<span style="position:absolute; color:black; left:747px; top:3496px; font-size:13.6px;">t</span>
<span style="position:absolute; color:black; left:751px; top:3496px; font-size:13.6px;">h</span>
<span style="position:absolute; color:black; left:757px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:763px; top:3497px; font-size:21.6px;">C</span>
<span style="position:absolute; color:black; left:777px; top:3497px; font-size:21.6px;">i</span>
<span style="position:absolute; color:black; left:782px; top:3497px; font-size:21.6px;">r</span>
<span style="position:absolute; color:black; left:789px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:795px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:800px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:810px; top:3497px; font-size:21.6px;">9</span>
<span style="position:absolute; color:black; left:821px; top:3497px; font-size:21.6px;">9</span>
<span style="position:absolute; color:black; left:831px; top:3497px; font-size:21.6px;">8</span>
<span style="position:absolute; color:black; left:842px; top:3497px; font-size:21.6px;">)</span>

這是我要匹配的正則表達式(使用Vim語法): [0-9]\\+ F\\.3d [0-9]\\+ 因此,在此示例中,我要匹配152 F.3d 1209 我想將其包裝在<a>以結束此操作:

<a href="http://www.stackoverflow.com/">
<span style="position:absolute; color:black; left:602px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:613px; top:3497px; font-size:21.6px;">5</span>
<span style="position:absolute; color:black; left:623px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:634px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:639px; top:3497px; font-size:21.6px;">F</span>
<span style="position:absolute; color:black; left:650px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:656px; top:3497px; font-size:21.6px;">3</span>
<span style="position:absolute; color:black; left:666px; top:3497px; font-size:21.6px;">d</span>
<span style="position:absolute; color:black; left:677px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:682px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:693px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:703px; top:3497px; font-size:21.6px;">0</span>
<span style="position:absolute; color:black; left:714px; top:3497px; font-size:21.6px;">9</span>
</a>

我可以寫一個冗長的正則表達式來忽略每個HTML標記,但是很快就變得不可行(例如,如果每個數字前后都有一個HTML標記,則很難匹配[0-9]\\+ )。

我可以使用%s/<.*>\\(.*\\)<.*>/\\1/g%s/<.*>\\(.*\\)<.*>/\\1/g東西剝離HTML,但這也不起作用,因為我需要保留格式。

我知道我不能用正則表達式解析HTML 但是我不需要解析任意HTML; 我只需要解決一組已知的標簽。 有沒有一種優雅的方法可以做到這一點? 還是應該放棄正則表達式並使用XPath解析器之類的東西?

我可以使用任何語言,但是我更喜歡使用Python,JavaScript或Vim。

好吧,我將文本節點提取為一個簡單的字符串,進行匹配,然后返回DOM樹以檢索初始HTML。 像這樣:

import lxml.html, lxml.etree
import re

with open('foo.html') as f:
    source = lxml.html.parse(f)

letters = source.findall('//span')
string = ''.join(s.text for s in letters)

match = re.search(r'[0-9]+ F\.3d [0-9]+', string)
assert match is not None

start, end = match.span()
html = '\n'.join(lxml.etree.tostring(el).decode('utf8')
                 for el in letters[start:end])

print('<a href="foo">{}</a>'.format(html))

請注意,循環中的tostring()可能不是最佳的性能。 您應該改為構建a元素,在其中添加字母並在a元素上調用tostring()一次。

該代碼缺少許多錯誤處理,並且依賴於嚴格的輸入格式,但是請考慮:

import re
import os

html = '''<span style="position:absolute; color:black; left:422px; top:3497px; font-size:21.6px;">M</span>
<span style="position:absolute; color:black; left:440px; top:3497px; font-size:21.6px;">T</span>
<span style="position:absolute; color:black; left:452px; top:3497px; font-size:21.6px;">V</span>
... (Lines omitted)
<span style="position:absolute; color:black; left:842px; top:3497px; font-size:21.6px;">)</span>
'''

# This is sloppy, but if your input format remains the same should work...
chars = ''.join([line[line.find('>') + 1] for line in html.splitlines()])
# chars => "MTV Networks, Inc., 152 F.3d 1209 (9th Cir. 1998)"

# Use regex to search chars
mat = re.search(r'\d+ F\.3d \d+', chars)

# Extract lines from html based on the start and end positions of the regex match
block = html.splitlines()[mat.start():mat.end()]

# Wrap the lines with your anchor tag    
block = ['<a href="http://www.stackoverflow.com/>'] + block + ['</a>']

# Print the list
print os.linesep.join(block)

它首先提取<span>標記內的單個字符,並將它們放入字符串中。 然后,它將在該字符串中搜索您的正則表達式(針對python的re模塊進行了修改)。

由於字符在chars字符串中的位置與html相應行的行號完全對應,因此我們可以使用chars字符串中匹配項的開始和結束位置來選擇要包裝的html行。

我們將與錨標記相對應的元素插入到block列表的開頭和結尾,並進行打印。

只要您輸入的內容與指定的內容完全相同,就無需調用DOM解析器或任何非常復雜的東西-盡管可能會發現需要類似的東西。

這是使用awk的解決方案:

$ cat mornin.awk
NR == FNR {
    gsub("</?span[^<]*>","",$0)
    s = s $0
    next
}

FNR == 1 {
    i = match(s, "[0-9]+ F\.3d [0-9]+")
    len = RLENGTH
    print "<a href=\"http://www.stackoverflow.com/\">"
}

FNR == i, FNR == (i + RLENGTH - 1)

END {
    print "</a>"
}

此解決方案需要兩次傳遞文本,因此您將文件兩次放在命令行上:

$ awk -f mornin.awk mornin.txt mornin.txt
<a href="http://www.stackoverflow.com/">
<span style="position:absolute; color:black; left:602px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:613px; top:3497px; font-size:21.6px;">5</span>
<span style="position:absolute; color:black; left:623px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:634px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:639px; top:3497px; font-size:21.6px;">F</span>
<span style="position:absolute; color:black; left:650px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:656px; top:3497px; font-size:21.6px;">3</span>
<span style="position:absolute; color:black; left:666px; top:3497px; font-size:21.6px;">d</span>
<span style="position:absolute; color:black; left:677px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:682px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:693px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:703px; top:3497px; font-size:21.6px;">0</span>
<span style="position:absolute; color:black; left:714px; top:3497px; font-size:21.6px;">9</span>
</a>

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM