[英]How to split up xml tags in Python from an input.txt then format them nicely(tabs, newlines, nesting)?
I'm trying to take file.txt that contains text like: 我正在尝试使用包含如下文本的file.txt:
<a> hello
<a>world
how <a>are
</a>you?</a><a></a></a>
and turn it into text like: 并将其转换为如下文本:
<a>
hello
<a>
world how
<a>
are
</a>
you?
</a>
<a>
</a>
my original thought was to create an XML item that holds a tag and content(list) and then just nest more XML items inside that list that hold content, but after spending some time I feel like I'm going about it the wrong way. 我最初的想法是创建一个包含标签和内容(列表)的XML项,然后将更多XML项嵌套在该列表中以容纳内容,但是花了一段时间后,我觉得我的做法是错误的。
For this I can't use an libraries like Element tree, I want to solve the problem from scratch . 为此,我不能使用像Element tree这样的库,我想从头开始解决问题 。 I'm not looking for all the answers I'm just hoping someone can help me choose the right direction to head in so I don't waste more hours coming up with a useless code base. 我并不是在寻找所有答案,我只是希望有人可以帮助我选择正确的方向,这样我就不会浪费更多的时间来编写无用的代码库。
-----------------------------------Answer Below-------------------------- -----------------------------------下面的答案------------- -------------
from stack import Stack
import re
import sys
def findTag(string):
# checks to see if a string has an xml tag returns the tag or none
try:
match = re.search(r"\<(.+?)\>", string)
return match.group(0), match.start(0)
except:
return None
def isTag(string):
# checks to see if a string is a tag and returns true or false.
try:
match = re.search(r"\<(.+?)\>", string)
match.group(0)
return True
except:
return False
else:
return False
def split_tags_and_string(string):
#splits up tag and string into a list
L = []
for line in s.split("\n"):
temp = line
while len(temp) >0: #string still has some characters
#print("line: " + temp)
tag_tuple = (findTag(temp)) #returns a tuple with tag and starting index
#print("tag_tuple: "+ str(tag_tuple))
if tag_tuple is not None: #there is a tag in the temp string
if tag_tuple[1] == 0: #tag is the front of temp string
L.append(tag_tuple[0].strip())
temp = temp.replace(tag_tuple[0], '', 1)
temp = temp.strip()
else: #tag is somewhere else other than the front of the temp string
L.append(temp[0:tag_tuple[1]].strip())
temp = temp.replace(temp[0:tag_tuple[1]], '', 1)
temp = temp.strip()
else: #there is no tag in the temp string
L.append(temp.strip())
temp = temp.replace(temp, '')
return L
def check_tags(formatted_list):
# verifies that the xml is valid
stack = Stack()
x=0
try:
#print(formatted_list)
for item in formatted_list:
tag = findTag(item)
#print("tag: "+ str(tag))
if tag is not None:
if tag[0].find('/') == -1:
endtag = tag[0][0:1] + '/' +tag[0][1:]
#print(endtag)
if formatted_list.count(tag[0]) != formatted_list.count(endtag):
#print("tag count doesn't match")
return False, x
if tag[0].find('/') == -1:
#print("pushing: "+tag[0])
stack.push(tag[0])
else:
#print("popping: "+tag[0])
stack.pop()
x+=1
except:
return False,x
if stack.isEmpty():
return True,x
else:
return False,x
def print_xml_list(formatted_list):
indent = 0
string = str()
previousIsString = False
#print(formatted_list)
for item in formatted_list:
#print("previous = " + str(previousIsString))
#print(item)
if len(item) > 0:
if isTag(item) == True and item.find('/') == -1:#the item is a tag and not and end tag
if previousIsString == True and string[len(string)-5:].find('\n') == -1:
#add a newline if there isn't one already
string+='\n'
string+=(' '*indent+item+'\n')
indent+=1 #increases indent
previousIsString = False #previous isn't a string
if isTag(item) == True and item.find('/') != -1: #the item is a tag and also an end tag
if previousIsString == True:
string+='\n'
indent-=1 # reduces indent
string+=(' '*indent+item+'\n')
previousIsString = False #previous isn't a string
if isTag(item) == False:
if previousIsString:
string+=(' '+item+' ') #adds item and no tab space
else:
string+=(' '*indent+item+' ') #adds item with tabs before
previousIsString = True # previous is a string
return string
if __name__ == "__main__":
filename = input("enter file name: ")
file = open(filename, 'r')
s = file.read()
formatted = split_tags_and_string(s) #formats the string and tags into a list called formatted
isGood = check_tags(formatted) # makes sure the xml is valid
if isGood[0] == False: #if the xml is bad it says so and ends the program
print("The xml file is bad.")
else:
string = print_xml_list(formatted) #adds indentation and formatting to the list and turns it into a string
print(string) #prints the final result
No one provided an answer so here is my basic way to parse xml, it does not have the functionality to handle things like 没有人提供答案,所以这是我解析xml的基本方法,它没有处理诸如
I provided the answer above. 我提供了以上答案。 Hopefully this will be useful to someone with a similar curiosity. 希望这对有类似好奇心的人有用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.