[英]Parsing HTML to get text inside an element
I need to get the text inside the two elements into a string: 我需要将两个元素中的文本转换为字符串:
source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>"""
>>> text
'Martin Elias'
How could I achieve this? 我怎么能实现这个目标?
I searched "python parse html" and this was the first result: https://docs.python.org/2/library/htmlparser.html 我搜索了“python parse html”,这是第一个结果: https : //docs.python.org/2/library/htmlparser.html
This code is taken from the python docs 此代码取自python docs
from HTMLParser import HTMLParser
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered a start tag:", tag
def handle_endtag(self, tag):
print "Encountered an end tag :", tag
def handle_data(self, data):
print "Encountered some data :", data
# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
Here is the result: 结果如下:
Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html
Using this and by looking at the code in HTMLParser I came up with this: 使用它并查看HTMLParser中的代码我想出了这个:
class myhtmlparser(HTMLParser):
def __init__(self):
self.reset()
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
def handle_starttag(self, tag, attrs):
self.NEWTAGS.append(tag)
self.NEWATTRS.append(attrs)
def handle_data(self, data):
self.HTMLDATA.append(data)
def clean(self):
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
You can use it like this: 你可以像这样使用它:
from HTMLParser import HTMLParser
pstring = source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>"""
class myhtmlparser(HTMLParser):
def __init__(self):
self.reset()
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
def handle_starttag(self, tag, attrs):
self.NEWTAGS.append(tag)
self.NEWATTRS.append(attrs)
def handle_data(self, data):
self.HTMLDATA.append(data)
def clean(self):
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
parser = myhtmlparser()
parser.feed(pstring)
# Extract data from parser
tags = parser.NEWTAGS
attrs = parser.NEWATTRS
data = parser.HTMLDATA
# Clean the parser
parser.clean()
# Print out our data
print tags
print attrs
print data
Now you should be able to extract your data from those lists easily. 现在,您应该能够轻松地从这些列表中提取数据。 I hope this helped!
我希望这有帮助!
I recommend using the Python Beautiful Soup 4 library. 我建议使用Python Beautiful Soup 4库。
pip install beautifulsoup4
It makes HTML parsing really easy. 它使HTML解析非常简单。
from bs4 import BeautifulSoup
source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>"""
soup = BeautifulSoup(source_code)
print soup.a.string
>>> 'Martin Elias'
Install beautifulsoup and You can do like this: 安装beautifulsoup,您可以这样做:
from BeautifulSoup import BeautifulSoup
source_code = '"""<span class="UserName"><a href="#">Martin Elias</a></span>"""'
soup = BeautifulSoup(source_code)
print soup.find('span',{'class':'UserName'}).text
You can also try using html5lib and XPath, there is a good question about it here , that answer has an important detail ( namespaceHTMLElements
) to remember to make html5lib behave as expected. 您也可以尝试使用html5lib和XPath, 这里有一个很好的问题 ,该答案有一个重要的细节(
namespaceHTMLElements
),以记住使html5lib按预期运行。 I wasted so much time trying to get it to work because I overlooked that I needed to change that. 我浪费了太多时间试图让它发挥作用,因为我忽略了我需要改变它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.