[英]Tab Formatted Nested String to Nested List ~ Python
Hello guys, after managing to get some data by scraping with Beautiful Soup... I want to format that data so as I could easily export it to CSV and JSON. 大家好 ,在设法通过使用Beautiful Soup抓取来获取一些数据之后……我想格式化该数据,以便可以轻松地将其导出为CSV和JSON。
My Question here is how can one translate this : 我的问题是,如何翻译此内容 :
Heading :
Subheading :
AnotherHeading :
AnotherSubheading :
Somedata
Heading :
Subheading :
AnotherHeading :
AnotherSubheading :
Somedata
Into this : 变成这个 :
[
['Heading',['Subheading']],
['AnotherHeading',['AnotherSubheading',['Somedata']]],
['Heading',['Subheading']],
['AnotherHeading',['AnotherSubheading',['Somedata']]]
]
Indented for clarity 为了清楚起见缩进
Any rescue attempt would be appreciated by a warm thank you ! 热烈的感谢,感谢您对任何救援尝试的感谢 !
So far with help we got: 到目前为止,我们得到了帮助:
def parse(data):
stack = [[]]
levels = [0]
current = stack[0]
for line in data.splitlines():
indent = len(line)-len(line.lstrip())
if indent > levels[-1]:
levels.append(indent)
stack.append([])
current.append(stack[-1])
current = stack[-1]
elif indent < levels[-1]:
stack.pop()
current = stack[-1]
levels.pop()
current.append(line.strip().rstrip(':'))
return stack
The problem with that code is that it returns... 该代码的问题是它返回...
[
'Heading ',
['Subheading '],
'AnotherHeading ',
['AnotherSubheading ', ['Somedata'], 'Heading ', 'Subheading '], 'AnotherHeading ',
['AnotherSubheading ', ['Somedata']]
]
Here is a repl: https://repl.it/yvM/1 这是一个副本: https : //repl.it/yvM/1
Thank you both kirbyfan64sos and SuperBiasedMan 谢谢kirbyfan64sos和SuperBiasedMan
def parse(data):
currentTab = 0
currentList = []
result = [currentList]
i = 0
tabCount = 0
for line in data.splitlines():
tabCount = len(line)-len(line.lstrip())
line = line.strip().rstrip(' :')
if tabCount == currentTab:
currentList.append(line)
elif tabCount > currentTab:
newList = [line]
currentList.append(newList)
currentList = newList
elif tabCount == 0:
currentList = [line]
result.append(currentList)
elif tabCount == 1:
currentList = [line]
result[-1].append(currentList)
currentTab = tabCount
tabCount = tabCount + 1
i = i + 1
print(result)
Well first you want to clear out unnecessary whitespace, so you make a list of all the lines that contain something more than whitespace and set up all the defaults that you start from for the main loop. 首先,您要清除不必要的空格,因此您要列出所有包含空格以外的行,并设置主循环的所有默认值。
teststring = [line for line in teststring.split('\n') if line.strip()]
currentTab = 0
currentList = []
result = [currentList]
This method replies on the mutability of lists, so setting currentList
as an empty list and then setting result
to [currentList]
is an important step, since we can now append to currentList
. 此方法依赖于列表的可变性,因此将currentList
设置为空列表,然后将result
设置为[currentList]
是重要的步骤,因为我们现在可以追加到currentList
。
for line in teststring:
i, tabCount = 0, 0
while line[i] == ' ':
tabCount += 1
i += 1
tabCount /= 8
This is the best way I could think of to check for tab characters at the start of each line. 这是我想到的在每行开头检查制表符的最佳方法。 Also, yes you'll notice I actually checked for spaces, not tabs. 另外,是的,您会注意到我实际上检查的是空格,而不是制表符。 Tabs just 100% didn't work, I think it was because I was using repl.it since I don't have Python 3 installed. 制表符仅100%无法正常工作,我认为这是因为我没有安装Python 3,所以一直在使用repl.it。 It works perfectly fine on Python 2.7 but I wont put code I haven't verified works. 它在Python 2.7上可以很好地工作,但是我不会输入未经验证的代码。 I can edit this if you confirm that using \\t
and removing tabCount /= 8
produces the desired results. 如果您确认使用\\t
并删除tabCount /= 8
会产生期望的结果,则可以对此进行编辑。
Now, check how indented the line is. 现在,检查行的缩进程度。 If it's the same as our currentTab
value, then just append to the currentList
. 如果它与我们的currentTab
值相同,则只需追加到currentList
。
if tabCount == currentTab:
currentList.append(line.strip())
If it's higher, that means we've gone to a deeper list level. 如果更高,则意味着我们已经进入了更深的列表级别。 We need a new list nested in currentList
. 我们需要一个嵌套在currentList
的新列表。
elif tabCount > currentTab:
newList = [line.strip()]
currentList.append(newList)
currentList = newList
Going backwards is trickier, since the data only contains 3 nesting levels I opted to hardcode what to do with the values 0 and 1 (2 should always result in one of the above blocks). 向后移动比较棘手,因为数据仅包含3个嵌套级别,所以我选择对值0和1进行硬编码(2应该总是导致上述块之一)。 If there are no tabs, we can append a new list to result
. 如果没有标签,我们可以将新列表附加到result
。
elif tabCount == 0:
currentList = [line.strip()]
result.append(currentList)
It's mostly the same for a one tab deep heading, except that you should append to result[-1]
, as that's the last main heading to nest into. 一个选项卡的深层标题几乎相同,除了您应该附加到result[-1]
,因为这是嵌套的最后一个主要标题。
elif tabCount == 1:
currentList = [line.strip()]
result[-1].append(currentList)
Lastly, make sure currentTab
is updated to what our current tabCount
is so the next iteration behaves properly. 最后,确保将currentTab
更新为当前的tabCount
,以便下一次迭代正常进行。
currentTab = tabCount
Something like: 就像是:
def parse(data):
stack = [[]]
levels = [0]
current = stack[0]
for line in data.splitlines():
indent = len(line)-len(line.lstrip())
if indent > levels[-1]:
levels.append(indent)
stack.append([])
current.append(stack[-1])
current = stack[-1]
elif indent < levels[-1]:
stack.pop()
current = stack[-1]
levels.pop()
current.append(line.strip().rstrip(':'))
return stack[0]
Your format looks a lot like YAML, though; 不过,您的格式看起来很像YAML; you may want to look into PyYAML. 您可能需要研究PyYAML。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.