[英]How to mix pandas and beautifulsoup to extract some element tags from a directory of xml files?
I have a directory with several xml files. 我有一个包含几个xml文件的目录。 Some files have the following element tags at the bottom of the document: 某些文件在文档底部具有以下元素标签:
<items>
<item id="id1" grocery="apple">
<stock id="id1.N1" alt="True" alt_id="10069227" type="fruit" type_id="10067060" />
</item>
<item id="id2" grocery="bannana">
<stock id="id2.N1" alt="True" alt_id="10015946" />
</item>
<item id="id3" grocery="orange">
<stock id="id3.N1" alt="True" alt_id="10019211" />
</item>
<item id="id4" grocery="garlic">
<stock id="id4.N1" alt="False" alt_id="10028810" />
</item>
<item id="id5" grocery="tomato">
<stock id="id5.N1" alt="False" alt_id="10020751" type="vegetable" type_id="10020756" />
</item>
<item id="id6" grocery="carrot">
<stock id="id6.N1" alt="False" alt_id="10037087" type="vegetable" type_id="10023084" />
</item>
<item id="AR7" grocery="onion">
<stock id="AR7.N1" alt="False" alt_id="10037844" />
</item>
<item id="id8" grocery="water mellon">
<stock id="id8.N1" alt="True" alt_id="10024570" type="fruit" type_id="10042703" />
</item>
<item id="id9" grocery="cherry">
<stock id="id9.N1" alt="True" alt_id="10042727" type="fruit" type_id="10042706" />
</item>
<item id="id10" grocery="Apricot">
<stock id="id10.N1" alt="False" alt_id="10034829" type="fruit" type_id="10043525" />
</item>
</items>
How can I extract grocery
, type
, type_id
, alt
and alt_id
elements inside the items
tags if they exist, and store them in a data frame? 如何提取items
标签内的grocery
, type
, type_id
, alt
和alt_id
元素(如果存在)并将其存储在数据框中?
id grocery alt alt_id type type_id
id1 apple true 10069227 fruit 10067060
id2 bannana true 10015946 NaN NaN
id3 orange true 10019211 NaN NaN
id4 garlic false 10020751 vegerable 10020756
...
id10 apricot false 10034829 fruit 10043525
Note that for the values or tags that do not exist I would like to add a NaN
. 请注意,对于不存在的值或标记,我想添加一个NaN
。 So far I tried to: 到目前为止,我试图:
import glob
import re
import pandas as pd
from bs4 import BeautifulSoup
data = []
for filename in glob.glob('../dir/*xml'):
soup = BeautifulSoup(open(filename), "lxml")
for element1 in soup(re.compile(r"items")):
data.append({**element1.attrs, **{'filename': filename, 'type': element1.name}})
for element2 in soup(re.compile(r"stock")):
data.append({**element2.attrs, **{'filename': filename, 'type': element2.name}})
#print(element2)
df = pd.DataFrame(data)
However, its not working. 但是,它不起作用。 As you can see, the above code ommited some xml labels I am interested in. 如您所见,上面的代码省略了一些我感兴趣的xml标签。
This is the actual output: 这是实际输出:
filename grocery id type
0 /Users/user/Downloads/test.xml NaN NaN items
1 /Users/user/Downloads/test.xml NaN NaN items
2 /Users/user/Downloads/test.xml apple id1 item
3 /Users/user/Downloads/test.xml bannana id2 item
4 /Users/user/Downloads/test.xml orange id3 item
5 /Users/user/Downloads/test.xml garlic id4 item
6 /Users/user/Downloads/test.xml tomato id5 item
7 /Users/user/Downloads/test.xml carrot id6 item
8 /Users/user/Downloads/test.xml onion AR7 item
9 /Users/user/Downloads/test.xml water mellon id8 item
Any idea of how to get the above dataframe? 关于如何获得上述数据框的任何想法?
UPDATE UPDATE
After trying to adapt @piRSquared answer for all the xml files in my directory I tried: 在尝试为目录中的所有xml文件适应@piRSquared答案之后,我尝试了:
for filename in glob.glob('../dir/*xml'):
#soup = BeautifulSoup(open(filename), "lxml")
etree = ET.ElementPath(filename)
pd.DataFrame([obs2series(o) for o in etree.findall('item')])
However, I got: 但是,我得到了:
---> 47 etree = ET.ElementPath(filename)
48 pd.DataFrame([obs2series(o) for o in etree.findall('item')])
49
TypeError: 'module' object is not callable
How can I do it for a directory full of xmls? 如何在充满xml的目录中执行此操作?
import pandas as pd
from cytoolz.dicttoolz import merge
from cytoolz import concat
from bs4 import BeautifulSoup
from glob import glob
lox = glob('./*xml')
def p_item(i):
s = i.find_all('stock')
return merge([j.attrs for j in s] + [i.attrs])
def p_soup(f):
soup = BeautifulSoup(open(f), "lxml")
return [p_item(i) for i in soup.find_all('item')]
pd.DataFrame(list(concat([p_soup(f) for f in lox])))
alt alt_id grocery id type type_id
0 True 10069227 apple id1 fruit 10067060
1 True 10015946 bannana id2 NaN NaN
2 True 10019211 orange id3 NaN NaN
3 False 10028810 garlic id4 NaN NaN
4 False 10020751 tomato id5 vegetable 10020756
5 False 10037087 carrot id6 vegetable 10023084
6 False 10037844 onion AR7 NaN NaN
7 True 10024570 water mellon id8 fruit 10042703
8 True 10042727 cherry id9 fruit 10042706
9 False 10034829 Apricot id10 fruit 10043525
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.