[英]How to mix pandas and beautifulsoup to extract some element tags from a directory of xml files?
我有一個包含幾個xml文件的目錄。 某些文件在文檔底部具有以下元素標簽:
<items>
<item id="id1" grocery="apple">
<stock id="id1.N1" alt="True" alt_id="10069227" type="fruit" type_id="10067060" />
</item>
<item id="id2" grocery="bannana">
<stock id="id2.N1" alt="True" alt_id="10015946" />
</item>
<item id="id3" grocery="orange">
<stock id="id3.N1" alt="True" alt_id="10019211" />
</item>
<item id="id4" grocery="garlic">
<stock id="id4.N1" alt="False" alt_id="10028810" />
</item>
<item id="id5" grocery="tomato">
<stock id="id5.N1" alt="False" alt_id="10020751" type="vegetable" type_id="10020756" />
</item>
<item id="id6" grocery="carrot">
<stock id="id6.N1" alt="False" alt_id="10037087" type="vegetable" type_id="10023084" />
</item>
<item id="AR7" grocery="onion">
<stock id="AR7.N1" alt="False" alt_id="10037844" />
</item>
<item id="id8" grocery="water mellon">
<stock id="id8.N1" alt="True" alt_id="10024570" type="fruit" type_id="10042703" />
</item>
<item id="id9" grocery="cherry">
<stock id="id9.N1" alt="True" alt_id="10042727" type="fruit" type_id="10042706" />
</item>
<item id="id10" grocery="Apricot">
<stock id="id10.N1" alt="False" alt_id="10034829" type="fruit" type_id="10043525" />
</item>
</items>
如何提取items
標簽內的grocery
, type
, type_id
, alt
和alt_id
元素(如果存在)並將其存儲在數據框中?
id grocery alt alt_id type type_id
id1 apple true 10069227 fruit 10067060
id2 bannana true 10015946 NaN NaN
id3 orange true 10019211 NaN NaN
id4 garlic false 10020751 vegerable 10020756
...
id10 apricot false 10034829 fruit 10043525
請注意,對於不存在的值或標記,我想添加一個NaN
。 到目前為止,我試圖:
import glob
import re
import pandas as pd
from bs4 import BeautifulSoup
data = []
for filename in glob.glob('../dir/*xml'):
soup = BeautifulSoup(open(filename), "lxml")
for element1 in soup(re.compile(r"items")):
data.append({**element1.attrs, **{'filename': filename, 'type': element1.name}})
for element2 in soup(re.compile(r"stock")):
data.append({**element2.attrs, **{'filename': filename, 'type': element2.name}})
#print(element2)
df = pd.DataFrame(data)
但是,它不起作用。 如您所見,上面的代碼省略了一些我感興趣的xml標簽。
這是實際輸出:
filename grocery id type
0 /Users/user/Downloads/test.xml NaN NaN items
1 /Users/user/Downloads/test.xml NaN NaN items
2 /Users/user/Downloads/test.xml apple id1 item
3 /Users/user/Downloads/test.xml bannana id2 item
4 /Users/user/Downloads/test.xml orange id3 item
5 /Users/user/Downloads/test.xml garlic id4 item
6 /Users/user/Downloads/test.xml tomato id5 item
7 /Users/user/Downloads/test.xml carrot id6 item
8 /Users/user/Downloads/test.xml onion AR7 item
9 /Users/user/Downloads/test.xml water mellon id8 item
關於如何獲得上述數據框的任何想法?
UPDATE
在嘗試為目錄中的所有xml文件適應@piRSquared答案之后,我嘗試了:
for filename in glob.glob('../dir/*xml'):
#soup = BeautifulSoup(open(filename), "lxml")
etree = ET.ElementPath(filename)
pd.DataFrame([obs2series(o) for o in etree.findall('item')])
但是,我得到了:
---> 47 etree = ET.ElementPath(filename)
48 pd.DataFrame([obs2series(o) for o in etree.findall('item')])
49
TypeError: 'module' object is not callable
如何在充滿xml的目錄中執行此操作?
import pandas as pd
from cytoolz.dicttoolz import merge
from cytoolz import concat
from bs4 import BeautifulSoup
from glob import glob
lox = glob('./*xml')
def p_item(i):
s = i.find_all('stock')
return merge([j.attrs for j in s] + [i.attrs])
def p_soup(f):
soup = BeautifulSoup(open(f), "lxml")
return [p_item(i) for i in soup.find_all('item')]
pd.DataFrame(list(concat([p_soup(f) for f in lox])))
alt alt_id grocery id type type_id
0 True 10069227 apple id1 fruit 10067060
1 True 10015946 bannana id2 NaN NaN
2 True 10019211 orange id3 NaN NaN
3 False 10028810 garlic id4 NaN NaN
4 False 10020751 tomato id5 vegetable 10020756
5 False 10037087 carrot id6 vegetable 10023084
6 False 10037844 onion AR7 NaN NaN
7 True 10024570 water mellon id8 fruit 10042703
8 True 10042727 cherry id9 fruit 10042706
9 False 10034829 Apricot id10 fruit 10043525
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.