简体   繁体   English

忽略正则表达式中的常规HTML标签

[英]Ignore regular HTML tags in regex

I need to find patterns in the text of an ugly HTML file. 我需要在难看的HTML文件的文本中找到模式。 It's ugly because each character is wrapped in an absolutely-positioned <span> , and each <span> is on its own line, like this: 这很丑陋,因为每个字符都包裹在绝对位置的<span> ,并且每个<span>都位于自己的行上,如下所示:

<span style="position:absolute; color:black; left:422px; top:3497px; font-size:21.6px;">M</span>
<span style="position:absolute; color:black; left:440px; top:3497px; font-size:21.6px;">T</span>
<span style="position:absolute; color:black; left:452px; top:3497px; font-size:21.6px;">V</span>
<span style="position:absolute; color:black; left:464px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:470px; top:3497px; font-size:21.6px;">N</span>
<span style="position:absolute; color:black; left:484px; top:3497px; font-size:21.6px;">e</span>
<span style="position:absolute; color:black; left:493px; top:3497px; font-size:21.6px;">t</span>
<span style="position:absolute; color:black; left:499px; top:3497px; font-size:21.6px;">w</span>
<span style="position:absolute; color:black; left:513px; top:3497px; font-size:21.6px;">o</span>
<span style="position:absolute; color:black; left:523px; top:3497px; font-size:21.6px;">r</span>
<span style="position:absolute; color:black; left:531px; top:3497px; font-size:21.6px;">k</span>
<span style="position:absolute; color:black; left:541px; top:3497px; font-size:21.6px;">s</span>
<span style="position:absolute; color:black; left:549px; top:3497px; font-size:21.6px;">,</span>
<span style="position:absolute; color:black; left:554px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:559px; top:3497px; font-size:21.6px;">I</span>
<span style="position:absolute; color:black; left:566px; top:3497px; font-size:21.6px;">n</span>
<span style="position:absolute; color:black; left:577px; top:3497px; font-size:21.6px;">c</span>
<span style="position:absolute; color:black; left:586px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:592px; top:3497px; font-size:21.6px;">,</span>
<span style="position:absolute; color:black; left:597px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:602px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:613px; top:3497px; font-size:21.6px;">5</span>
<span style="position:absolute; color:black; left:623px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:634px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:639px; top:3497px; font-size:21.6px;">F</span>
<span style="position:absolute; color:black; left:650px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:656px; top:3497px; font-size:21.6px;">3</span>
<span style="position:absolute; color:black; left:666px; top:3497px; font-size:21.6px;">d</span>
<span style="position:absolute; color:black; left:677px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:682px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:693px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:703px; top:3497px; font-size:21.6px;">0</span>
<span style="position:absolute; color:black; left:714px; top:3497px; font-size:21.6px;">9</span>
<span style="position:absolute; color:black; left:724px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:729px; top:3497px; font-size:21.6px;">(</span>
<span style="position:absolute; color:black; left:736px; top:3497px; font-size:21.6px;">9</span>
<span style="position:absolute; color:black; left:747px; top:3496px; font-size:13.6px;">t</span>
<span style="position:absolute; color:black; left:751px; top:3496px; font-size:13.6px;">h</span>
<span style="position:absolute; color:black; left:757px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:763px; top:3497px; font-size:21.6px;">C</span>
<span style="position:absolute; color:black; left:777px; top:3497px; font-size:21.6px;">i</span>
<span style="position:absolute; color:black; left:782px; top:3497px; font-size:21.6px;">r</span>
<span style="position:absolute; color:black; left:789px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:795px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:800px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:810px; top:3497px; font-size:21.6px;">9</span>
<span style="position:absolute; color:black; left:821px; top:3497px; font-size:21.6px;">9</span>
<span style="position:absolute; color:black; left:831px; top:3497px; font-size:21.6px;">8</span>
<span style="position:absolute; color:black; left:842px; top:3497px; font-size:21.6px;">)</span>

This is the regex I would like to match (in Vim syntax): [0-9]\\+ F\\.3d [0-9]\\+ . 这是我要匹配的正则表达式(使用Vim语法): [0-9]\\+ F\\.3d [0-9]\\+ So, in this example, I want to match 152 F.3d 1209 . 因此,在此示例中,我要匹配152 F.3d 1209 I want to wrap that in an <a> to end up with this: 我想将其包装在<a>以结束此操作:

<a href="http://www.stackoverflow.com/">
<span style="position:absolute; color:black; left:602px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:613px; top:3497px; font-size:21.6px;">5</span>
<span style="position:absolute; color:black; left:623px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:634px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:639px; top:3497px; font-size:21.6px;">F</span>
<span style="position:absolute; color:black; left:650px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:656px; top:3497px; font-size:21.6px;">3</span>
<span style="position:absolute; color:black; left:666px; top:3497px; font-size:21.6px;">d</span>
<span style="position:absolute; color:black; left:677px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:682px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:693px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:703px; top:3497px; font-size:21.6px;">0</span>
<span style="position:absolute; color:black; left:714px; top:3497px; font-size:21.6px;">9</span>
</a>

I could write a verbose regex to ignore every HTML tag, but that quickly becomes unworkable (for instance, it would be hard to match [0-9]\\+ if there is an HTML tag before and after each digit). 我可以写一个冗长的正则表达式来忽略每个HTML标记,但是很快就变得不可行(例如,如果每个数字前后都有一个HTML标记,则很难匹配[0-9]\\+ )。

I could strip out the HTML using something like %s/<.*>\\(.*\\)<.*>/\\1/g , but that doesn't work either, because I need to preserve the formatting. 我可以使用%s/<.*>\\(.*\\)<.*>/\\1/g%s/<.*>\\(.*\\)<.*>/\\1/g东西剥离HTML,但这也不起作用,因为我需要保留格式。

I get that I can't parse HTML with a regex . 我知道我不能用正则表达式解析HTML But I don't need to parse arbitrary HTML; 但是我不需要解析任意HTML; I just need to work around a known set of tags. 我只需要解决一组已知的标签。 Is there an elegant way to do this? 有没有一种优雅的方法可以做到这一点? Or should I abandon regexes and use something like an XPath parser? 还是应该放弃正则表达式并使用XPath解析器之类的东西?

I'm open to any language, but I'd prefer to work with Python, JavaScript, or Vim. 我可以使用任何语言,但是我更喜欢使用Python,JavaScript或Vim。

Well, I would extract the text nodes to a simple string, match on it, then go back to the DOM tree to retrieve the initial HTML. 好吧,我将文本节点提取为一个简单的字符串,进行匹配,然后返回DOM树以检索初始HTML。 Something like that: 像这样:

import lxml.html, lxml.etree
import re

with open('foo.html') as f:
    source = lxml.html.parse(f)

letters = source.findall('//span')
string = ''.join(s.text for s in letters)

match = re.search(r'[0-9]+ F\.3d [0-9]+', string)
assert match is not None

start, end = match.span()
html = '\n'.join(lxml.etree.tostring(el).decode('utf8')
                 for el in letters[start:end])

print('<a href="foo">{}</a>'.format(html))

Please note that tostring() in a loop may not be the best as for performance. 请注意,循环中的tostring()可能不是最佳的性能。 You should instead build the a element, append the letters in it and call tostring() on the a element once. 您应该改为构建a元素,在其中添加字母并在a元素上调用tostring()一次。

This code is missing a lot of error handling, and relies on a strict input format, but consider: 该代码缺少许多错误处理,并且依赖于严格的输入格式,但是请考虑:

import re
import os

html = '''<span style="position:absolute; color:black; left:422px; top:3497px; font-size:21.6px;">M</span>
<span style="position:absolute; color:black; left:440px; top:3497px; font-size:21.6px;">T</span>
<span style="position:absolute; color:black; left:452px; top:3497px; font-size:21.6px;">V</span>
... (Lines omitted)
<span style="position:absolute; color:black; left:842px; top:3497px; font-size:21.6px;">)</span>
'''

# This is sloppy, but if your input format remains the same should work...
chars = ''.join([line[line.find('>') + 1] for line in html.splitlines()])
# chars => "MTV Networks, Inc., 152 F.3d 1209 (9th Cir. 1998)"

# Use regex to search chars
mat = re.search(r'\d+ F\.3d \d+', chars)

# Extract lines from html based on the start and end positions of the regex match
block = html.splitlines()[mat.start():mat.end()]

# Wrap the lines with your anchor tag    
block = ['<a href="http://www.stackoverflow.com/>'] + block + ['</a>']

# Print the list
print os.linesep.join(block)

It first extracts the single characters inside the <span> tag and puts them in a string. 它首先提取<span>标记内的单个字符,并将它们放入字符串中。 It then searches that string for your regex (modified for python's re module). 然后,它将在该字符串中搜索您的正则表达式(针对python的re模块进行了修改)。

Since the position of the character in the chars string corresponds exactly to the line number of the corresponding line in html , we can use the start and end position of the match inside the chars string to select the lines of html we want to wrap. 由于字符在chars字符串中的位置与html相应行的行号完全对应,因此我们可以使用chars字符串中匹配项的开始和结束位置来选择要包装的html行。

We insert elements into the block list at the beginning and end, corresponding to your anchor tags, and print it. 我们将与锚标记相对应的元素插入到block列表的开头和结尾,并进行打印。

As long as your input remains exactly as your specify, there's no need to invoke a DOM parser or anything very complex -- although it may turn out that something like that is needed. 只要您输入的内容与指定的内容完全相同,就无需调用DOM解析器或任何非常复杂的东西-尽管可能会发现需要类似的东西。

Here's a solution using awk: 这是使用awk的解决方案:

$ cat mornin.awk
NR == FNR {
    gsub("</?span[^<]*>","",$0)
    s = s $0
    next
}

FNR == 1 {
    i = match(s, "[0-9]+ F\.3d [0-9]+")
    len = RLENGTH
    print "<a href=\"http://www.stackoverflow.com/\">"
}

FNR == i, FNR == (i + RLENGTH - 1)

END {
    print "</a>"
}

This solution requires two passes over the text, so you put the file twice on the command line: 此解决方案需要两次传递文本,因此您将文件两次放在命令行上:

$ awk -f mornin.awk mornin.txt mornin.txt
<a href="http://www.stackoverflow.com/">
<span style="position:absolute; color:black; left:602px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:613px; top:3497px; font-size:21.6px;">5</span>
<span style="position:absolute; color:black; left:623px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:634px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:639px; top:3497px; font-size:21.6px;">F</span>
<span style="position:absolute; color:black; left:650px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:656px; top:3497px; font-size:21.6px;">3</span>
<span style="position:absolute; color:black; left:666px; top:3497px; font-size:21.6px;">d</span>
<span style="position:absolute; color:black; left:677px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:682px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:693px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:703px; top:3497px; font-size:21.6px;">0</span>
<span style="position:absolute; color:black; left:714px; top:3497px; font-size:21.6px;">9</span>
</a>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM