简体   繁体   English

如何从python中的字符串中提取某些信息?

[英]How can I extract certain information from a string in python?

I am trying to use python to extract certain information from html code. 我正在尝试使用python从html代码中提取某些信息。 for example: 例如:

<a href="#tips">Visit the Useful Tips Section</a> 
and I would like to get result : Visit the Useful Tips Section

<div id="menu" style="background-color:#FFD700;height:200px;width:100px;float:left;">
<b>Menu</b><br />
HTML<br />
CSS<br />
and I would like to get Menu HTML CSS

In other word, I wish to get everything between <>and<> I am trying to write a python function that takes the html code as a string, and then extract information from there. 换句话说,我希望得到<>和<>之间的所有内容,我试图编写一个将html代码作为字符串的python函数,然后从那里提取信息。 I am stuck at string.split('<'). 我被困在string.split('<')。

您应该使用适当的HTML解析库,例如HTMLParser模块。

string = '<a href="#tips">Visit the Useful Tips Section</a>'
re.findall('<[^>]*>(.*)<[^>]*>', string) //return 'Visit the Useful Tips Section'

You can use lxml html parser. 您可以使用lxml html解析器。

>>> import lxml.html as lh
>>> st = ''' load your above html content into a string '''
>>> d = lh.fromstring(st)
>>> d.text_content()

'Visit the Useful Tips Section \nand I would like to get result : Visit the Useful Tips Section\n\n\nMenu\nHTML\nCSS\nand I would
like to get Menu HTML CSS\n'

or you can do 或者你可以做

>>> for content in d.text_content().split("\n"):
...     if content:
...             print content
...
Visit the Useful Tips Section
and I would like to get result : Visit the Useful Tips Section
Menu
HTML
CSS
and I would like to get Menu HTML CSS
>>>

I understand you are trying to strip out the HTML tags and keep only the text. 我了解您正在尝试剥离HTML标签并仅保留文本。

You can define a regular expression that represents the tags. 您可以定义代表标签的正则表达式。 Then substitute all matches with the empty string. 然后用空字符串替换所有匹配项。

Example: 例:

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

References: 参考文献:

Example

Docs about python regular expressions 关于python正则表达式的文档

我会使用BeautifulSoup-格式错误的html会减少胡思乱想。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM