简体   繁体   English

Python在标签之间获取子字符串

[英]Python get substrings inbetween tags

I'm trying to get data from a pipe in Python. 我正在尝试从Python中的管道获取数据。 The data is structured like this: 数据的结构如下:

<item><type> data </type><code> data </code><length> data </length><data encoding=“base64”> data </data></item>

How do I get the data inbetween these tags? 如何获取这些标签之间的数据? I've already written a Base64 decoder. 我已经写了一个Base64解码器。

One way is to use the lxml package and treat the raw data as a html 一种方法是使用lxml包并将原始数据视为html

from lxml import html

raw_data = '<item><type> data </type><code> data </code><length> data </length><data encoding=“base64”> data </data></item>'
html_data = html.fromstring(raw_data)

data = html_data.xpath('//text()')

# data = [' data ', ' data ', ' data ', ' data ']

that may be a bit overkill, another way is using regular expression 可能有点矫kill过正,另一种方法是使用正则表达式

pattern copied from https://kevin.deldycke.com/2008/07/python-ultimate-regular-expression-to-catch-html-tags/ https://kevin.deldycke.com/2008/07/python-ultimate-regular-expression-to-catch-html-tags/复制的模式

import re

def get_data(text):
    pattern = "(?i)<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>"
    return re.sub(p, '', text).split()

text = '<item><type> data </type><code> data </code><length> data </length><data encoding=“base64”> data </data></item>'
print(get_data(text))

# ['data', 'data', 'data', 'data']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM