制表符格式的嵌套字符串到嵌套列表〜Python

Question

Hello guys, after managing to get some data by scraping with Beautiful Soup... I want to format that data so as I could easily export it to CSV and JSON. 大家好 ，在设法通过使用Beautiful Soup抓取来获取一些数据之后……我想格式化该数据，以便可以轻松地将其导出为CSV和JSON。

My Question here is how can one translate this : 我的问题是，如何翻译此内容 ：

Heading :
    Subheading :

AnotherHeading : 
    AnotherSubheading :
        Somedata

Heading :
    Subheading :

AnotherHeading : 
    AnotherSubheading :
        Somedata

Into this : 变成这个 ：

[
['Heading',['Subheading']],
['AnotherHeading',['AnotherSubheading',['Somedata']]],
['Heading',['Subheading']],
['AnotherHeading',['AnotherSubheading',['Somedata']]]
]

Indented for clarity 为了清楚起见缩进

Any rescue attempt would be appreciated by a warm thank you ! 热烈的感谢，感谢您对任何救援尝试的感谢！

So far with help we got: 到目前为止，我们得到了帮助：

def parse(data):
  stack = [[]]
  levels = [0]
  current = stack[0]
  for line in data.splitlines():
    indent = len(line)-len(line.lstrip())
    if indent > levels[-1]:
      levels.append(indent)
      stack.append([])
      current.append(stack[-1])
      current = stack[-1]
    elif indent < levels[-1]:
      stack.pop()
      current = stack[-1]
      levels.pop()
    current.append(line.strip().rstrip(':'))
  return stack

The problem with that code is that it returns... 该代码的问题是它返回...

[
'Heading ', 
['Subheading '], 
'AnotherHeading ', 
['AnotherSubheading ', ['Somedata'], 'Heading ', 'Subheading '], 'AnotherHeading ', 
['AnotherSubheading ', ['Somedata']]
]

Here is a repl: https://repl.it/yvM/1 这是一个副本： https : //repl.it/yvM/1

Answer 1

Thank you both kirbyfan64sos and SuperBiasedMan 谢谢kirbyfan64sos和SuperBiasedMan

def parse(data):

  currentTab = 0
  currentList = []
  result = [currentList]

  i = 0
  tabCount = 0

  for line in data.splitlines():

    tabCount = len(line)-len(line.lstrip())

    line = line.strip().rstrip(' :')

    if tabCount == currentTab:
        currentList.append(line)

    elif tabCount > currentTab:
        newList = [line]
        currentList.append(newList)
        currentList = newList

    elif tabCount == 0:
        currentList = [line]
        result.append(currentList)

    elif tabCount == 1:
        currentList = [line]
        result[-1].append(currentList)

    currentTab = tabCount

    tabCount = tabCount + 1
    i = i + 1

  print(result)

Answer 2

Well first you want to clear out unnecessary whitespace, so you make a list of all the lines that contain something more than whitespace and set up all the defaults that you start from for the main loop. 首先，您要清除不必要的空格，因此您要列出所有包含空格以外的行，并设置主循环的所有默认值。

teststring = [line for line in teststring.split('\n') if line.strip()]
currentTab = 0
currentList = []
result = [currentList]

This method replies on the mutability of lists, so setting currentList as an empty list and then setting result to [currentList] is an important step, since we can now append to currentList . 此方法依赖于列表的可变性，因此将currentList设置为空列表，然后将result设置为[currentList]是重要的步骤，因为我们现在可以追加到currentList 。

for line in teststring:
    i, tabCount = 0, 0

    while line[i] == ' ':
        tabCount += 1
        i += 1
    tabCount /= 8

This is the best way I could think of to check for tab characters at the start of each line. 这是我想到的在每行开头检查制表符的最佳方法。 Also, yes you'll notice I actually checked for spaces, not tabs. 另外，是的，您会注意到我实际上检查的是空格，而不是制表符。 Tabs just 100% didn't work, I think it was because I was using repl.it since I don't have Python 3 installed. 制表符仅100％无法正常工作，我认为这是因为我没有安装Python 3，所以一直在使用repl.it。 It works perfectly fine on Python 2.7 but I wont put code I haven't verified works. 它在Python 2.7上可以很好地工作，但是我不会输入未经验证的代码。 I can edit this if you confirm that using \\t and removing tabCount /= 8 produces the desired results. 如果您确认使用\\t并删除tabCount /= 8会产生期望的结果，则可以对此进行编辑。

Now, check how indented the line is. 现在，检查行的缩进程度。 If it's the same as our currentTab value, then just append to the currentList . 如果它与我们的currentTab值相同，则只需追加到currentList 。

    if tabCount == currentTab:
        currentList.append(line.strip())

If it's higher, that means we've gone to a deeper list level. 如果更高，则意味着我们已经进入了更深的列表级别。 We need a new list nested in currentList . 我们需要一个嵌套在currentList的新列表。

    elif tabCount > currentTab:
        newList = [line.strip()]
        currentList.append(newList)
        currentList = newList

Going backwards is trickier, since the data only contains 3 nesting levels I opted to hardcode what to do with the values 0 and 1 (2 should always result in one of the above blocks). 向后移动比较棘手，因为数据仅包含3个嵌套级别，所以我选择对值0和1进行硬编码（2应该总是导致上述块之一）。 If there are no tabs, we can append a new list to result . 如果没有标签，我们可以将新列表附加到result 。

    elif tabCount == 0:
        currentList = [line.strip()]
        result.append(currentList)

It's mostly the same for a one tab deep heading, except that you should append to result[-1] , as that's the last main heading to nest into. 一个选项卡的深层标题几乎相同，除了您应该附加到result[-1] ，因为这是嵌套的最后一个主要标题。

    elif tabCount == 1:
        currentList = [line.strip()]
        result[-1].append(currentList)

Lastly, make sure currentTab is updated to what our current tabCount is so the next iteration behaves properly. 最后，确保将currentTab更新为当前的tabCount ，以便下一次迭代正常进行。

    currentTab = tabCount

Answer 3

Something like: 就像是：

def parse(data):
    stack = [[]]
    levels = [0]
    current = stack[0]
    for line in data.splitlines():
        indent = len(line)-len(line.lstrip())
        if indent > levels[-1]:
            levels.append(indent)
            stack.append([])
            current.append(stack[-1])
            current = stack[-1]
        elif indent < levels[-1]:
            stack.pop()
            current = stack[-1]
            levels.pop()
        current.append(line.strip().rstrip(':'))
    return stack[0]

Your format looks a lot like YAML, though; 不过，您的格式看起来很像YAML； you may want to look into PyYAML. 您可能需要研究PyYAML。

制表符格式的嵌套字符串到嵌套列表〜Python

问题描述

3 个解决方案

解决方案1
1 2015-07-28 16:23:33

解决方案2
0 已采纳 2015-07-28 15:42:28

解决方案3
-2 2015-07-22 01:44:23

制表符格式的嵌套字符串到嵌套列表〜Python

问题描述

3 个解决方案

解决方案1 1 2015-07-28 16:23:33

解决方案2 0 已采纳 2015-07-28 15:42:28

解决方案3 -2 2015-07-22 01:44:23

解决方案1
1 2015-07-28 16:23:33

解决方案2
0 已采纳 2015-07-28 15:42:28

解决方案3
-2 2015-07-22 01:44:23