简体   繁体   中英

Python Regex - Identifying the first and last items in a list

I need to transform some text files into HTML code. I'm stuck in transforming a list into an HTML unordered list. Example source:

some text in the document
* item 1
* item 2
* item 3
some other text

The output should be:

some text in the document
<ul>
    <li>item 1</li>
    <li>item 2</li>
    <li>item 3</li>
</ul>
some other text

Currently, I have this:

r = re.compile(r'\*(.*)\n')
r.sub('<li>\1</li>', the_text_document)

which creates an HTML list without < ul > tags.
How can I identify the first and last items and surround them with < ul > tags?

You could just process you data line by line .. this quick and dirty solution below could probably be tidied up, but for your data it does the trick.

with open('data.txt') as inf:
    star_count = 0
    for line in inf:
        line = line.strip()

        if not line.startswith('*'):
            if star_count == 1:
                print'</ul>'
            print line
        else:
            if star_count == 0:
                print '<ul>'
                star_count = 1
            print '  <li>%s</li>'  %line.split('*')[1].strip()

yields:

some text in the document
<ul>
  <li>item 1</li>
  <li>item 2</li>
  <li>item 3</li>
</ul>
some other text

Depending on how complex your data, or if you have repeating unumbered lists etc this will require modification and you may want to look for a more general solution, or modify this starter code to fill your needs, only you can decide.

Update :

Edited <li> .. </li> print line to get rid of * that were previously left.

Or use BeautifulSoup

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

edit

I apparently have to give you some hint on how to read documentation.

  • Open the link
  • On the left there is a big menu (teal color)
  • If you look carefully you will notice that the documentation is divided in multiple sections
    • Stuffs
    • Navigation in the tree
    • Searching the tree
    • Modifying the tree (got it)
    • Output (got it!)

And many more things

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work .

Don't stop reading after the first sentence... The last one is pretty important and what's in the middle to.

In other word, you can create an empty document... let say:

soup = BeautifulSoup("<div></div>")
document = soup.div

then you read each lines of you text.. and then do that whenever you have text.

document.append(line)

if the line starts with a `*``

ul = document.new_tag('ul')
document.append(ul)
document = ul

then push all the li on the document... and once you end up reading * , just pop the parent so the document gets back to the div. And keep doing that... you can even do it recursively to insert ul into ul s.

Once you parsed everything... you can do

str(document)

or

document.prettify()

Edit

just realized that you weren't editing the html but a unformatted text.. You could try using markdown then.

http://daringfireball.net/projects/markdown/

After playing with some ideas, I've decided to go with a second regex. So basically, after running the first regex (from my original post, that creates the <li> tags), I run:

r = re.compile(r'(<li>.*?</li>\n(?!\s*<li>))', re.DOTALL)
r.sub('<ul>\\1</ul>', string_with_li_tags)

This will find the first match of <li> tag and the last match of </li>\\n combo, not followed by a <li> tag (which essentially means the entire list) and add <ul> tags.

EDIT: I modified the regex a bit so it won't be greedy. This way it can handle multiple lists in the same document. Only requirement is that there are no spaces between list items, as @Aprillion mentioned below

EDIT 2: Modified the negative lookahead to treat spaces between list items as well, so all cases are covered

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM