Hello guys, after managing to get some data by scraping with Beautiful Soup... I want to format that data so as I could easily export it to CSV and JSON.
My Question here is how can one translate this :
Heading :
Subheading :
AnotherHeading :
AnotherSubheading :
Somedata
Heading :
Subheading :
AnotherHeading :
AnotherSubheading :
Somedata
Into this :
[
['Heading',['Subheading']],
['AnotherHeading',['AnotherSubheading',['Somedata']]],
['Heading',['Subheading']],
['AnotherHeading',['AnotherSubheading',['Somedata']]]
]
Indented for clarity
Any rescue attempt would be appreciated by a warm thank you !
So far with help we got:
def parse(data):
stack = [[]]
levels = [0]
current = stack[0]
for line in data.splitlines():
indent = len(line)-len(line.lstrip())
if indent > levels[-1]:
levels.append(indent)
stack.append([])
current.append(stack[-1])
current = stack[-1]
elif indent < levels[-1]:
stack.pop()
current = stack[-1]
levels.pop()
current.append(line.strip().rstrip(':'))
return stack
The problem with that code is that it returns...
[
'Heading ',
['Subheading '],
'AnotherHeading ',
['AnotherSubheading ', ['Somedata'], 'Heading ', 'Subheading '], 'AnotherHeading ',
['AnotherSubheading ', ['Somedata']]
]
Here is a repl: https://repl.it/yvM/1
Thank you both kirbyfan64sos and SuperBiasedMan
def parse(data):
currentTab = 0
currentList = []
result = [currentList]
i = 0
tabCount = 0
for line in data.splitlines():
tabCount = len(line)-len(line.lstrip())
line = line.strip().rstrip(' :')
if tabCount == currentTab:
currentList.append(line)
elif tabCount > currentTab:
newList = [line]
currentList.append(newList)
currentList = newList
elif tabCount == 0:
currentList = [line]
result.append(currentList)
elif tabCount == 1:
currentList = [line]
result[-1].append(currentList)
currentTab = tabCount
tabCount = tabCount + 1
i = i + 1
print(result)
Well first you want to clear out unnecessary whitespace, so you make a list of all the lines that contain something more than whitespace and set up all the defaults that you start from for the main loop.
teststring = [line for line in teststring.split('\n') if line.strip()]
currentTab = 0
currentList = []
result = [currentList]
This method replies on the mutability of lists, so setting currentList
as an empty list and then setting result
to [currentList]
is an important step, since we can now append to currentList
.
for line in teststring:
i, tabCount = 0, 0
while line[i] == ' ':
tabCount += 1
i += 1
tabCount /= 8
This is the best way I could think of to check for tab characters at the start of each line. Also, yes you'll notice I actually checked for spaces, not tabs. Tabs just 100% didn't work, I think it was because I was using repl.it since I don't have Python 3 installed. It works perfectly fine on Python 2.7 but I wont put code I haven't verified works. I can edit this if you confirm that using \\t
and removing tabCount /= 8
produces the desired results.
Now, check how indented the line is. If it's the same as our currentTab
value, then just append to the currentList
.
if tabCount == currentTab:
currentList.append(line.strip())
If it's higher, that means we've gone to a deeper list level. We need a new list nested in currentList
.
elif tabCount > currentTab:
newList = [line.strip()]
currentList.append(newList)
currentList = newList
Going backwards is trickier, since the data only contains 3 nesting levels I opted to hardcode what to do with the values 0 and 1 (2 should always result in one of the above blocks). If there are no tabs, we can append a new list to result
.
elif tabCount == 0:
currentList = [line.strip()]
result.append(currentList)
It's mostly the same for a one tab deep heading, except that you should append to result[-1]
, as that's the last main heading to nest into.
elif tabCount == 1:
currentList = [line.strip()]
result[-1].append(currentList)
Lastly, make sure currentTab
is updated to what our current tabCount
is so the next iteration behaves properly.
currentTab = tabCount
Something like:
def parse(data):
stack = [[]]
levels = [0]
current = stack[0]
for line in data.splitlines():
indent = len(line)-len(line.lstrip())
if indent > levels[-1]:
levels.append(indent)
stack.append([])
current.append(stack[-1])
current = stack[-1]
elif indent < levels[-1]:
stack.pop()
current = stack[-1]
levels.pop()
current.append(line.strip().rstrip(':'))
return stack[0]
Your format looks a lot like YAML, though; you may want to look into PyYAML.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.