简体   繁体   English

在 Python 中使用 BeautifulSoup 识别和替换 XML 元素

[英]Identify and replace elements of XML using BeautifulSoup in Python

I am trying to use BeautifulSoup4 to find and replace specific elements within an XML.我正在尝试使用 BeautifulSoup4 来查找和替换 XML 中的特定元素。 More specifically, I want to find all instances of 'file_name'(in the example below the file name is 'Cyp26A1_atRA_minus_tet_plus.txt') and replace it with the full path for that document - which is saved in the 'file_name_replacement_dir' variable.更具体地说,我想找到“file_name”的所有实例(在下面的示例中,文件名是“Cyp26A1_atRA_minus_tet_plus.txt”)并将其替换为该文档的完整路径 - 保存在“file_name_replacement_dir”变量中。 My first task, the bit i'm stuck on, is to isolate the section of interest so that I can replace it using the replaceWith() method.我的第一个任务,也就是我坚持的一点,是隔离感兴趣的部分,以便我可以使用 replaceWith() 方法替换它。

The XML XML

      <ParameterGroup name="Experiment_22">
        <Parameter name="Data is Row Oriented" type="bool" value="1"/>
        <Parameter name="Experiment Type" type="unsignedInteger" value="0"/>
        <Parameter name="File Name" type="file" value="Cyp26A1_atRA_minus_tet_plus.txt"/>
        <Parameter name="First Row" type="unsignedInteger" value="1"/>

There are actually 44 experiments with 4 different file names (So 11 with file name 1, 11 with file name 2 and so on).实际上有 44 个实验有 4 个不同的文件名(所以 11 个文件名 1,11 个文件名 2 等等)。 So the above snippet of XML is repeated 44 times, just with different files stored in the "File Name" line.所以上面的 XML 片段重复了 44 次,只是在“文件名”行中存储了不同的文件。

My Code so far我的代码到目前为止

xml_dir = 'D:\MPhil\Model_Building\Models\Retinoic_acid\[06]\RAR_Models\Model_Line_2'
xml_file_name = 'RARa_RXR_M22.cps'
xml=model_dir+'\\'+model_name
file_name_replacement_dir = D:\MPhil\Model_Building\Models\Retinoic_acid\[06]\RAR_Models
soup = BeautifulSoup(open(xml))
print soup.find_all('parametergroup name="Experiment_22"')

This however returns an empty list.然而,这会返回一个空列表。 I've also tried a few other functions in place of 'soup.findall()' but still haven't been able to find a handle to the filename.我还尝试了一些其他函数来代替“soup.findall()”,但仍然无法找到文件名的句柄。 Does anybody know how to do what I'm trying to do?有人知道如何做我想做的事吗?

xml = '<ParameterGroup name="Experiment_22">\
<Parameter name="Data is Row Oriented" type="bool" value="1"/>\
<Parameter name="Experiment Type" type="unsignedInteger" value="0"/>\
<Parameter name="File Name" type="file" value="Cyp26A1_atRA_minus_tet_plus.txt"/>\
<Parameter name="First Row" type="unsignedInteger" value="1"/>\
</ParameterGroup>'

from bs4 import BeautifulSoup
import os
soup = BeautifulSoup(xml)

for tag in soup.find_all("parameter", {'name': 'File Name'}):
    tag['value'] = os.path.join('new_dir', tag['value'])

print soup
  • Close your XML 'ParameterGroup' tag.关闭您的 XML 'ParameterGroup' 标签。
  • Capitalisation of tags may not work with BeautifulSoup, try parameter in lower case.标签的大写可能不适用于 BeautifulSoup,请尝试小写parameter
  • use os.path to manipulate paths so that it works cross-platforms.使用os.path来操作路径,使其跨平台工作。

Your selector for find_all is wrong you need to separate the tag name and attribute like so:您的 find_all 选择器是错误的,您需要像这样将标签名称和属性分开:

find_all("Parameter",{'name':'File Name'})

That will get you all the file name tags directly.这将直接为您提供所有文件名标签。 If you really need the parent tag then pass in "ParameterGroup" without the attribute dictionary.如果您确实需要父标签,则在没有属性字典的情况下传入“ParameterGroup”。

Not sure if BeautifulSoup require lower casing your tags, you may have to experiment with that.不确定 BeautifulSoup 是否需要小写您的标签,您可能需要对此进行试验。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM