轉換HTML列表（ <li> ）到標簽（即縮進）

Question

曾經使用過幾十種語言，但對Python來說是新手。

這是我的第一個（也許是第二個）問題，所以要溫柔......

試圖有效地將類似HTML的降價文本轉換為wiki格式（特別是Linux Tomboy / GNote注釋到Zim）並且一直停留在轉換列表上。

對於像這樣的2級無序列表......

第一級
- 第二級

Tomboy / GNote使用類似......

<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>

但是，Zim個人維基希望這樣......

* First level
  * Second level

...帶有前導標簽。

我已經探索了regex模塊函數re.sub（），re.match（），re.search（）等，並發現了很酷的Python能力，可以將重復文本編碼為......

 count * "text"

因此，看起來應該有辦法做某事......

 newnote = re.sub("<list>", LEVEL * "\t", oldnote)

其中LEVEL是注釋中<list>的序數（出現）。 因此，對於第一個<list> ，第二個為1 ，等等為0 。

每次遇到</list>時，LEVEL將減少。

<list-item>標簽將轉換為項目符號的星號（在適當的前面加上換行符），並刪除</list-item>標簽。

最后......問題......

如何獲取LEVEL的值並將其用作制表符倍增器？

Answer 1

您應該使用xml解析器來執行此操作，但要回答您的問題：

import re

def next_tag(s, tag):
    i = -1
    while True:
        try:
            i = s.index(tag, i+1)
        except ValueError:
            return
        yield i

a = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"

a = a.replace("<list-item>", "* ")

for LEVEL, ind in enumerate(next_tag(a, "<list>")):
    a = re.sub("<list>", "\n" + LEVEL * "\t", a, 1)

a = a.replace("</list-item>", "")
a = a.replace("</list>", "")

print a

這將適用於您的示例，僅適用於您的示例。 使用XML解析器。 你可以使用xml.dom.minidom （它包含在Python中（至少2.7），不需要下載任何東西）：

import xml.dom.minidom

def parseList(el, lvl=0):
    txt = ""
    indent = "\t" * (lvl)
    for item in el.childNodes:
        # These are the <list-item>s: They can have text and nested <list> tag
        for subitem in item.childNodes:
            if subitem.nodeType is xml.dom.minidom.Element.TEXT_NODE:
                # This is the text before the next <list> tag
                txt += "\n" + indent + "* " + subitem.nodeValue
            else:
                # This is the next list tag, its indent level is incremented
                txt += parseList(subitem, lvl=lvl+1)
    return txt

def parseXML(s):
    doc = xml.dom.minidom.parseString(s)
    return parseList(doc.firstChild)

a = "<list><list-item>First level<list><list-item>Second level</list-item><list-item>Second level 2<list><list-item>Third level</list-item></list></list-item></list></list-item></list>"
print parseXML(a)

輸出：

* First level
    * Second level
    * Second level 2
        * Third level

Answer 2

使用美麗的湯，它允許您迭代標簽，即使它們是習俗。 這種操作非常實用

from BeautifulSoup import BeautifulSoup
tags = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"
soup = BeautifulSoup(tags)
print [[ item.text for item in list_tag('list-item')]  for list_tag in soup('list')]

Output : [[u'First level'], [u'Second level']]

我使用了嵌套列表解析，但您可以使用嵌套的for循環

for list_tag in soup('list'):
     for item in list_tag('list-item'):
         print item.text

我希望能幫助你。

在我的示例中，我使用了BeautifulSoup 3，但該示例應該與BeautifulSoup4一起使用，但只有導入更改。

from bs4 import BeautifulSoup

轉換HTML列表（ <li> ）到標簽（即縮進）

問題描述

2 個解決方案

解決方案1
4 已采納 2012-04-15 12:19:19

解決方案2
2 2012-04-15 11:30:15

轉換HTML列表（ <li> ）到標簽（即縮進）

問題描述

2 個解決方案

解決方案1 4 已采納 2012-04-15 12:19:19

解決方案2 2 2012-04-15 11:30:15

解決方案1
4 已采納 2012-04-15 12:19:19

解決方案2
2 2012-04-15 11:30:15