简体   繁体   中英

Python — Regex — How to find a string between two sets of strings

Consider the following:

<div id=hotlinklist>
  <a href="foo1.com">Foo1</a>
  <div id=hotlink>
    <a href="/">Home</a>
  </div>
  <div id=hotlink>
    <a href="/extract">Extract</a>
  </div>
  <div id=hotlink>
    <a href="/sitemap">Sitemap</a>
  </div>
</div>

How would you go about taking out the sitemap line with regex in python ?

<a href="/sitemap">Sitemap</a>

The following can be used to pull out the anchor tags.

'/<a(.*?)a>/i'

However, there are multiple anchor tags. Also there are multiple hotlink(s) so we can't really use them either?

Don't use a regex. Use BeautfulSoup , an HTML parser.

from BeautifulSoup import BeautifulSoup

html = \
"""
<div id=hotlinklist>
  <a href="foo1.com">Foo1</a>
  <div id=hotlink>
    <a href="/">Home</a>
  </div>
  <div id=hotlink>
    <a href="/extract">Extract</a>
  </div>
  <div id=hotlink>
    <a href="/sitemap">Sitemap</a>
  </div>
</div>"""

soup = BeautifulSoup(html)
soup.findAll("div",id="hotlink")[2].a

# <a href="/sitemap">Sitemap</a>

Parsing HTML with regular expression is a bad idea!

Think about the following piece of html

<a></a > <!-- legal html, but won't pass your regex -->

<a href="/sitemap">Sitemap<!-- proof that a>b iff ab>1 --></a>

There are many more such examples. Regular expressions are good for many things, but not for parsing HTML.

You should consider using Beautiful Soup python HTML parser.

Anyhow, a ad-hoc solution using regex is

import re

data = """
<div id=hotlinklist>
  <a href="foo1.com">Foo1</a>
  <div id=hotlink>
    <a href="/">Home</a>
  </div>
  <div id=hotlink>
    <a href="/extract">Extract</a>
  </div>
  <div id=hotlink>
    <a href="/sitemap">Sitemap</a>
  </div>
</div>
"""

e = re.compile('<a *[^>]*>.*</a *>')

print e.findall(data)

Output:

>>> e.findall(data)
['<a href="foo1.com">Foo1</a>', '<a href="/">Home</a>', '<a href="/extract">Extract</a>', '<a href="/sitemap">Sitemap</a>']

In order to extract the contents of the tagline:

    <a href="/sitemap">Sitemap</a>

... I would use:

    >>> import re
    >>> s = '''
    <div id=hotlinklist>
    <a href="foo1.com">Foo1</a>
      <div id=hotlink>
        <a href="/">Home</a>
      </div>
      <div id=hotlink>
        <a href="/extract">Extract</a>
      </div>
      <div id=hotlink>
        <a href="/sitemap">Sitemap</a>
      </div>
    </div>'''
    >>> m = re.compile(r'<a href="/sitemap">(.*?)</a>').search(s)
    >>> m.group(1)
    'Sitemap'

Use BeautifulSoup or lxml if you need to parse HTML.

Also, what is it that you really need to do? Find the last link? Find the third link? Find the link that points to /sitemap? It's unclear from you question. What do you need to do with the data?

If you really have to use regular expressions, have a look at findall .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM