如何将熊猫和beautifulsoup混合以从xml文件目录中提取一些元素标签？

Question

I have a directory with several xml files. 我有一个包含几个xml文件的目录。 Some files have the following element tags at the bottom of the document: 某些文件在文档底部具有以下元素标签：

<items>
    <item id="id1" grocery="apple">
      <stock id="id1.N1" alt="True" alt_id="10069227" type="fruit" type_id="10067060" />
    </item>
    <item id="id2" grocery="bannana">
      <stock id="id2.N1" alt="True" alt_id="10015946" />
    </item>
    <item id="id3" grocery="orange">
      <stock id="id3.N1" alt="True" alt_id="10019211" />
    </item>
    <item id="id4" grocery="garlic">
      <stock id="id4.N1" alt="False" alt_id="10028810" />
    </item>
    <item id="id5" grocery="tomato">
      <stock id="id5.N1" alt="False" alt_id="10020751" type="vegetable" type_id="10020756" />
    </item>
    <item id="id6" grocery="carrot">
      <stock id="id6.N1" alt="False" alt_id="10037087" type="vegetable" type_id="10023084" />
    </item>
    <item id="AR7" grocery="onion">
      <stock id="AR7.N1" alt="False" alt_id="10037844" />
    </item>
    <item id="id8" grocery="water mellon">
      <stock id="id8.N1" alt="True" alt_id="10024570" type="fruit" type_id="10042703" />
    </item>
    <item id="id9" grocery="cherry">
      <stock id="id9.N1" alt="True" alt_id="10042727" type="fruit" type_id="10042706" />
    </item>
    <item id="id10" grocery="Apricot">
      <stock id="id10.N1" alt="False" alt_id="10034829" type="fruit" type_id="10043525" />
    </item>
  </items>

How can I extract grocery , type , type_id , alt and alt_id elements inside the items tags if they exist, and store them in a data frame? 如何提取items标签内的grocery ， type ， type_id ， alt和alt_id元素（如果存在）并将其存储在数据框中？

id grocery alt alt_id type type_id
id1 apple true 10069227 fruit 10067060
id2 bannana true 10015946 NaN NaN
id3 orange true 10019211 NaN NaN
id4 garlic false 10020751 vegerable 10020756
...
id10 apricot false 10034829 fruit 10043525

Note that for the values or tags that do not exist I would like to add a NaN . 请注意，对于不存在的值或标记，我想添加一个NaN 。 So far I tried to: 到目前为止，我试图：

import glob
import re
import pandas as pd
from bs4 import BeautifulSoup

data = []
for filename in glob.glob('../dir/*xml'):
    soup = BeautifulSoup(open(filename), "lxml")

    for element1 in soup(re.compile(r"items")):
        data.append({**element1.attrs, **{'filename': filename, 'type': element1.name}})

    for element2 in soup(re.compile(r"stock")):
        data.append({**element2.attrs, **{'filename': filename, 'type': element2.name}})        
    #print(element2)

df = pd.DataFrame(data)

However, its not working. 但是，它不起作用。 As you can see, the above code ommited some xml labels I am interested in. 如您所见，上面的代码省略了一些我感兴趣的xml标签。

This is the actual output: 这是实际输出：

  filename  grocery   id  type
0   /Users/user/Downloads/test.xml  NaN   NaN   items
1   /Users/user/Downloads/test.xml  NaN   NaN   items
2   /Users/user/Downloads/test.xml  apple   id1   item
3   /Users/user/Downloads/test.xml  bannana   id2   item
4   /Users/user/Downloads/test.xml  orange  id3   item
5   /Users/user/Downloads/test.xml  garlic  id4   item
6   /Users/user/Downloads/test.xml  tomato  id5   item
7   /Users/user/Downloads/test.xml  carrot  id6   item
8   /Users/user/Downloads/test.xml  onion   AR7   item
9   /Users/user/Downloads/test.xml  water mellon  id8   item

Any idea of how to get the above dataframe? 关于如何获得上述数据框的任何想法？

UPDATE UPDATE

After trying to adapt @piRSquared answer for all the xml files in my directory I tried: 在尝试为目录中的所有xml文件适应@piRSquared答案之后，我尝试了：

for filename in glob.glob('../dir/*xml'):
    #soup = BeautifulSoup(open(filename), "lxml")
    etree = ET.ElementPath(filename)
    pd.DataFrame([obs2series(o) for o in etree.findall('item')])

However, I got: 但是，我得到了：

---> 47     etree = ET.ElementPath(filename)
     48     pd.DataFrame([obs2series(o) for o in etree.findall('item')])
     49 

TypeError: 'module' object is not callable

How can I do it for a directory full of xmls? 如何在充满xml的目录中执行此操作？

Answer 1

import pandas as pd
from cytoolz.dicttoolz import merge
from cytoolz import concat
from bs4 import BeautifulSoup
from glob import glob

lox = glob('./*xml')

def p_item(i):
    s = i.find_all('stock')
    return merge([j.attrs for j in s] + [i.attrs])

def p_soup(f):
    soup = BeautifulSoup(open(f), "lxml")
    return [p_item(i) for i in soup.find_all('item')]

pd.DataFrame(list(concat([p_soup(f) for f in lox])))

     alt    alt_id       grocery    id       type   type_id
0   True  10069227         apple   id1      fruit  10067060
1   True  10015946       bannana   id2        NaN       NaN
2   True  10019211        orange   id3        NaN       NaN
3  False  10028810        garlic   id4        NaN       NaN
4  False  10020751        tomato   id5  vegetable  10020756
5  False  10037087        carrot   id6  vegetable  10023084
6  False  10037844         onion   AR7        NaN       NaN
7   True  10024570  water mellon   id8      fruit  10042703
8   True  10042727        cherry   id9      fruit  10042706
9  False  10034829       Apricot  id10      fruit  10043525

如何将熊猫和beautifulsoup混合以从xml文件目录中提取一些元素标签？

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-07-18 16:28:44

如何将熊猫和beautifulsoup混合以从xml文件目录中提取一些元素标签？

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-07-18 16:28:44

解决方案1
2 已采纳 2017-07-18 16:28:44