I have a directory with several xml files. Some files have the following element tags at the bottom of the document:
<items>
<item id="id1" grocery="apple">
<stock id="id1.N1" alt="True" alt_id="10069227" type="fruit" type_id="10067060" />
</item>
<item id="id2" grocery="bannana">
<stock id="id2.N1" alt="True" alt_id="10015946" />
</item>
<item id="id3" grocery="orange">
<stock id="id3.N1" alt="True" alt_id="10019211" />
</item>
<item id="id4" grocery="garlic">
<stock id="id4.N1" alt="False" alt_id="10028810" />
</item>
<item id="id5" grocery="tomato">
<stock id="id5.N1" alt="False" alt_id="10020751" type="vegetable" type_id="10020756" />
</item>
<item id="id6" grocery="carrot">
<stock id="id6.N1" alt="False" alt_id="10037087" type="vegetable" type_id="10023084" />
</item>
<item id="AR7" grocery="onion">
<stock id="AR7.N1" alt="False" alt_id="10037844" />
</item>
<item id="id8" grocery="water mellon">
<stock id="id8.N1" alt="True" alt_id="10024570" type="fruit" type_id="10042703" />
</item>
<item id="id9" grocery="cherry">
<stock id="id9.N1" alt="True" alt_id="10042727" type="fruit" type_id="10042706" />
</item>
<item id="id10" grocery="Apricot">
<stock id="id10.N1" alt="False" alt_id="10034829" type="fruit" type_id="10043525" />
</item>
</items>
How can I extract grocery
, type
, type_id
, alt
and alt_id
elements inside the items
tags if they exist, and store them in a data frame?
id grocery alt alt_id type type_id
id1 apple true 10069227 fruit 10067060
id2 bannana true 10015946 NaN NaN
id3 orange true 10019211 NaN NaN
id4 garlic false 10020751 vegerable 10020756
...
id10 apricot false 10034829 fruit 10043525
Note that for the values or tags that do not exist I would like to add a NaN
. So far I tried to:
import glob
import re
import pandas as pd
from bs4 import BeautifulSoup
data = []
for filename in glob.glob('../dir/*xml'):
soup = BeautifulSoup(open(filename), "lxml")
for element1 in soup(re.compile(r"items")):
data.append({**element1.attrs, **{'filename': filename, 'type': element1.name}})
for element2 in soup(re.compile(r"stock")):
data.append({**element2.attrs, **{'filename': filename, 'type': element2.name}})
#print(element2)
df = pd.DataFrame(data)
However, its not working. As you can see, the above code ommited some xml labels I am interested in.
This is the actual output:
filename grocery id type
0 /Users/user/Downloads/test.xml NaN NaN items
1 /Users/user/Downloads/test.xml NaN NaN items
2 /Users/user/Downloads/test.xml apple id1 item
3 /Users/user/Downloads/test.xml bannana id2 item
4 /Users/user/Downloads/test.xml orange id3 item
5 /Users/user/Downloads/test.xml garlic id4 item
6 /Users/user/Downloads/test.xml tomato id5 item
7 /Users/user/Downloads/test.xml carrot id6 item
8 /Users/user/Downloads/test.xml onion AR7 item
9 /Users/user/Downloads/test.xml water mellon id8 item
Any idea of how to get the above dataframe?
UPDATE
After trying to adapt @piRSquared answer for all the xml files in my directory I tried:
for filename in glob.glob('../dir/*xml'):
#soup = BeautifulSoup(open(filename), "lxml")
etree = ET.ElementPath(filename)
pd.DataFrame([obs2series(o) for o in etree.findall('item')])
However, I got:
---> 47 etree = ET.ElementPath(filename)
48 pd.DataFrame([obs2series(o) for o in etree.findall('item')])
49
TypeError: 'module' object is not callable
How can I do it for a directory full of xmls?
import pandas as pd
from cytoolz.dicttoolz import merge
from cytoolz import concat
from bs4 import BeautifulSoup
from glob import glob
lox = glob('./*xml')
def p_item(i):
s = i.find_all('stock')
return merge([j.attrs for j in s] + [i.attrs])
def p_soup(f):
soup = BeautifulSoup(open(f), "lxml")
return [p_item(i) for i in soup.find_all('item')]
pd.DataFrame(list(concat([p_soup(f) for f in lox])))
alt alt_id grocery id type type_id
0 True 10069227 apple id1 fruit 10067060
1 True 10015946 bannana id2 NaN NaN
2 True 10019211 orange id3 NaN NaN
3 False 10028810 garlic id4 NaN NaN
4 False 10020751 tomato id5 vegetable 10020756
5 False 10037087 carrot id6 vegetable 10023084
6 False 10037844 onion AR7 NaN NaN
7 True 10024570 water mellon id8 fruit 10042703
8 True 10042727 cherry id9 fruit 10042706
9 False 10034829 Apricot id10 fruit 10043525
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.