[英]How to remove multiple spaces and characters from a XML files using regex with python?
我在XML文件中有數百行,就像這兩個例子:
<settings site_id="someID123" xmltv_id="Some text - dummy (2) HH">Some text - dummy (2) HH</settings>
<settings site_id="moreID321" xmltv_id="More Text">More Text</settings>
我想用python regex格式化xmltv_id =“ HERE ”里面的所有東西,沒有空格,破折號或括號,並在末尾添加.xx
xmltv_id="Some text - dummy (2) HH"
xmltv_id="More Text"
變得像這樣
xmltv_id="Sometextdummy2HH.xx"
xmltv_id="MoreText.xx"
我該怎么做?
謝謝!
考慮以下方法 - 讀取和解析xml,修改數據,編寫xml。
import xml.etree.ElementTree as ET
tree = ET.parse('1.xml')
for element in tree.findall('settings'):
element.set('xmltv_id', element.get('xmltv_id').replace(' ', ''))
tree.write('2.xml')
原始xml 1.xml
:
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
<settings site_id="someID123" xmltv_id="Some text - dummy (2) HH">Some text - dummy (2) HH</settings>
</note>
修改后的xml 2.xml
:
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
<settings site_id="someID123" xmltv_id="Sometext-dummy(2)HH">Some text - dummy (2) HH</settings>
</note>
在解析結構化數據(如XML / HTML)時,正則表達式永遠不是一種強大而合適的方法。 使用適當的解析器。
使用etree.ElementTree
模塊和re.sub
函數:
import xml.etree.ElementTree as ET
import re
root = ET.parse('yourxml.xml').getroot()
pat = re.compile(r'[\s()-]+') # regex character class for chars to replace
for el in root.findall('settings[@xmltv_id]'):
el.set("xmltv_id", pat.sub('', el.get("xmltv_id")) + '.xx')
ET.dump(root)
樣本輸出:
<main>
<settings site_id="someID123" xmltv_id="Sometextdummy2HH.xx">Some text - dummy (2) HH</settings>
<settings site_id="moreID321" xmltv_id="MoreText.xx">More Text</settings>
</main>
您可以使用https://docs.python.org/3.7/library/xml.etree.elementtree.html#xml.etree.ElementTree.ElementTree.write輕松地將生成的elementTree保存到新文件中。
我不認為你可以用python中的單個正則表達式完成這個。 我能想到的解決方案是這樣的:
import re
def format_line(line):
m = re.search('(.*xmltv_id=")(.*)(".*)', line)
stripped_tag = re.sub(' |-|\(|\)','', m.group(2))
return f'{m.group(1)}{stripped_tag}.xx{m.group(3)}'
>>> format_line('<settings site_id="someID123" xmltv_id="Some text - dummy (2) HH">Some text - dummy (2) HH</settings>')
'<settings site_id="someID123" xmltv_id="Sometextdummy2HH.xx">Some text - dummy (2) HH</settings>'
有了re :
import re
xmltv_id1="Some text - dummy (2) HH"
xmltv_id2="More Text"
replace_regex = r'\s|[-]|[(]|[)]'
print(re.sub(replace_regex, '', xmltv_id1) + '.xx'))
print(re.sub(replace_regex, '', xmltv_id2) + '.xx'))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.