To make this question easier to understand, below is an example
<Tag name="Thumbnail" inline="no" nonsearchable="yes">
<Attribute>
<Attribute name="AText" Searchable="yes"></Attribute>
</Attribute>
</Tag>
<Tag name="Label" inline="no" nonsearchable="yes">
<Attribute>
<Attribute name="AText" Searchable="no"></Attribute>
</Attribute>
</Tag>
<Tag name="Image" inline="no" nonsearchable="yes">
<Attribute>
<Attribute name="BText" Searchable="yes">
</Attribute>
</Tag>
<Tag name="Wonder" inline="no" nonsearchable="yes">
<Attribute>
<Attribute name="BText" Searchable="yes"></Attribute>
</Attribute>
</Tag>
Expected Result
So in the excel, the first row should be the the Attribute tag's name value if the Attribute tag's Searchable value is " yes "; Then these "qualified" Attributes tag's parent tag - Tag - name value would be listed underneath.
Currently, I can only find all the Tag 's name value if it's children's Searchable value is "yes", but can't categorize them under corresponding Attribute tag's name value. Below is my initial code:
import os, openpyxl
from bs4 import BeautifulSoup
cwd = os.getcwd()
def func(x):
for file in os.listdir(cwd):
if file.endswith('.xml'):
f = open(file, encoding = 'utf=8', mode = 'r+')
soup = BeautifulSoup(f, 'lxml')
AttrYES = soup.find_all(attrs={"Searchable": "yes"})
for items in AttrYES:
tagName = items.parent.parent.get('name')
print (tagName)
x = os.listdir(cwd)
func(x)
I'll try to work this out as well, but to make the process faster, if you have any ideas, please advice. Thank you!!
Your code cannot find anything, if you print AttrYES
, it will be []
. The thing is that when you use bs4
with parser lxml
, all tag and attr name will convert to lowercase,refer to official doc . If you print the soup, it will give you:
<html><body><tag inline="no" name="Thumbnail" nonsearchable="yes">
<attribute>
<attribute name="AText" searchable="yes"></attribute>
</attribute>
</tag>
<tag inline="no" name="Label" nonsearchable="yes">
<attribute>
<attribute name="AText" searchable="no"></attribute>
</attribute>
</tag>
<tag inline="no" name="Image" nonsearchable="yes">
<attribute>
<attribute name="BText" searchable="yes">
</attribute>
</attribute></tag>
<tag inline="no" name="Wonder" nonsearchable="yes">
<attribute>
<attribute name="BText" searchable="yes"></attribute>
</attribute>
</tag></body></html>
Hence, you could amend your code like this:
import bs4
f = open('test.xml',mode = 'r+')
soup = bs4.BeautifulSoup(f, 'lxml')
AttrYES = soup.findAll(attrs={'searchable': 'yes'})
result = dict()
for items in AttrYES:
result[items.get('name')] = result.get(items.get('name'),[])+[items.parent.parent.get('name')]
print(result)
The print will be :
{'AText': ['Thumbnail'], 'BText': ['Image', 'Wonder']}
Then you could write them to your excel file:
import xlsxwriter
workbook = xlsxwriter.Workbook('result.xlsx')
worksheet = workbook.add_worksheet()
# Write header
worksheet.write(0, 0, result.keys()[0])
worksheet.write(0, 1, result.keys()[1])
# Write data.
worksheet.write_column(1, 0, result.values()[0])
worksheet.write_column(1, 1, result.values()[1])
workbook.close()
The result.xlsx
will be:
Update: with openpyxl
from openpyxl import Workbook
wb = Workbook()
# grab the active worksheet
ws = wb.active
# Data can be assigned directly to cells
i,j = 1,1
for keys,values in a.items():
ws.cell(column=i, row=1, value=keys)
for row in range(len(values)):
ws.cell(column=i, row=j+1, value=values[row])
j+=1
j=1
i+=1
wb.save("result.xlsx")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.