简体   繁体   English

Python BeautifulSoup,解析XML

[英]Python BeautifulSoup, parsing XML

I would like to extract just the SQL error line: 我只想提取SQL错误行:

SQL Error: failed

- however all text from msg tag is printed: -但是,将打印msg标记中的所有文本:

Test: 01
SQL Error: failed


Test: 01
SQL Error: failed

file.xml file.xml

<item>
<msg>
Test: 01
SQL Error: failed
</msg>
</item>
<item>
<msg>
Test: 01
SQL Error: failed
</msg>
</item>

Code: 码:

import re
from BeautifulSoup import BeautifulStoneSoup

file = "file.xml"

with open(file, 'r') as f:
    fobj = f.read()
    bobj = BeautifulStoneSoup(fobj)
    pattern = re.compile('SQL Error')
    for error in bobj.findAll('msg', text=pattern):
        print error

This is how it is supposed to be working - you are getting a Tag class instance as a result of find_all() call. 这是应该如何工作的-由于find_all()调用而导致获得Tag类实例。 Even if you print out the error.text - you'll get a complete text of the msg element: 即使您打印出error.text您也会获得msg元素的完整文本:

Test: 01
SQL Error: failed

Assuming you want to extract the failed part, here is what you can do: 假设您要提取failed部分,可以执行以下操作:

pattern = re.compile('SQL Error: (\w+)')
for error in bobj.find_all("msg", text=pattern):
    print pattern.search(error.text).group(1)

Here we are using capturing groups to save one or more alphanumeric characters ( \\w+ ) after the SQL Error: text. 在这里,我们使用捕获组SQL Error:文本后保存一个或多个字母数字字符( \\w+ )。

Also, you should definitely upgrade to BeautifulSoup 4 : 另外,您绝对应该升级到BeautifulSoup 4

pip install beautifulsoup4

And then import it as: 然后将其导入为:

from bs4 import BeautifulSoup

Using BeautifulSoup 4 you can change 使用BeautifulSoup 4,您可以更改

print error

to

print error.get_text().strip().split("\n")[1]

error is a tag, so you first get the string value from it with get_text() , the you have to strip off the leading and trailing carriage returns with strip() . error是一个标签,所以你首先从它那里得到的字符串值与get_text()时,你必须去掉与前沿和后回车strip() You then make an array with each value being a separate line, and the value you want is the second line so you access it with [1] . 然后,您创建一个数组,每个值都是单独的一行,而所需的值是第二行,因此您可以使用[1]访问。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM