简体   繁体   English

从文本中提取HTML标签和数据

[英]Extract HTML tags and data from text

I'm using Python 2.7 to try and do a simple call to a website to extract the HTML data, which I've managed with the code below. 我正在使用Python 2.7尝试对网站进行简单调用以提取HTML数据,这些数据已通过下面的代码进行管理。

import requests
from HTMLParser import HTMLParser

name = "Mark"
surname = "Jacobs"

def req_getPageHTML(nume, prenume):
    url = "http://sample.com/page.aspx&Name=" + name + "&surname=" + surname
    response = requests.get(url).text
    return response

page_code = req_getPageHTML(nume, prenume)

htmlp = HTMLParser()

print htmlp.feed(page_code)

The next thing that I want to do is somehow extract or parse this UNICODE response ( print type(page_code) returns UNICODE ) to somehow extract some information from it. 我想做的下一件事是以某种方式提取或解析此UNICODE响应( print type(page_code)返回UNICODE )以某种方式从中提取一些信息。

Specifically, from the below sample HTML which I can get back, I want to extract the values (numbers which are slightly inset in the below HTML code and also prefixed with a > - this doesn't exist in the HTML code, it's just for being easily identified by you guys). 具体来说,我想从下面的示例HTML中提取值(在下面的HTML代码中稍微插入的数字,并以>开头)-HTML代码中不存在,仅用于容易被你们识别)。

...
<tr class="tr1" OnClick="lockBac();">
    <td class="tdB" rowspan="2" nowrap="nowrap">1</td>
    <td class="tdB" rowspan="2" nowrap="nowrap">Jacobs D <br/>Mark</td>
    <td class="tdB" rowspan="2" align="Center">Math speciality</td>
    <td class="tdB" rowspan="2" align="Center">Advanced User</td>
        >   <td class="tdB" rowspan="2" align="Center">6.95</td>
        >   <td class="tdB" rowspan="2" align="Center">7.9</td>
        >   <td class="tdB" rowspan="2" align="Center">7.9</td>
    <td class="tdB" colspan="4" align="Center"></td>
    <td class="tdB" rowspan="2" align="Center">English</td>
    <td class="tdB" rowspan="2" align="Center">B2-B2-B2-B2-B2</td>
    <td class="tdB" colspan="3" align="Center">Mathematics MATH-INFO</td>
    <td class="tdB" colspan="3" align="Center">Informatics</td>
    <td bgcolor="lightgreen" class="tdB" rowspan="2" align="Center"></td>
    <td class="tdB" rowspan="2" align="Center">8.88</td>
    <td class="tdB" rowspan="2" align="Center">Success</td>
</tr>
<tr class="tr1" OnClick="lockBac();">
    <td class="tdB"></td>
    <td class="tdB"></td>
    <td class="tdB"></td>
    <td class="tdB"></td>
        >    <td class="tdB">9.35</td>
        >    <td class="tdB"></td>
        >    <td class="tdB">9.35</td>
        >    <td class="tdB">9.4</td>
    <td class="tdB"></td>
        >    <td class="tdB">9.4</td>
</tr>
...

What these numbers represent is Exam scores, which I will later put in a DB. 这些数字代表的是考试成绩,我稍后将其放入数据库中。

Now, I'm trying to look for an efficient way to extract these numbers as I would prefer to leave parsing the text to look for each element (manually with SUBSTR and so on) as a last option. 现在,我正在尝试寻找一种有效的方法来提取这些数字,因为我宁愿保留解析文本以查找每个元素( SUBSTR使用SUBSTR等)作为最后的选择。

I did come across HTMLParser, which as you can see is also imported into my code, but the bottom print returns None . 我确实遇到了HTMLParser,如您所见,它也已导入到我的代码中,但最下面的print返回None

I was under the impression that I can use this library to parse the text received from response and there would be an easier way to specify a tag ID (or something similar) and extract the relevant information from it (like it is shown in the HTMLParser examples section ), but I can't get the necessary information I want from using the feed method. 我的印象是,我可以使用该库来解析从response接收到的文本,并且有一种更简单的方法来指定标签ID(或类似名称)并从中提取相关信息(如HTMLParser中所示)范例部分 ),但是我无法通过使用feed方法获得所需的信息。

Maybe I'm not understanding this correctly and maybe I'm not using the appropriate tool, so that is why I also explained my goal. 也许我没有正确理解这一点,也许我没有使用适当的工具,所以这就是为什么我也解释了我的目标的原因。

I would appreciate any help in correcting or pointing me into the right direction. 如果能帮助我纠正或指出正确的方向,我将不胜感激。

Not sure how to work with what you have tried, but I have a different method. 不知道如何使用您尝试过的方法,但是我有另一种方法。

You can grab lxml , a python library that helps out with scraping XML and HTML. 您可以获取lxml ,这是一个有助于抓取XML和HTML的python库。 It seems Requests will also help out with this project. 似乎Requests也会对这个项目有所帮助。

page = requests.get('http://www.example.com')
tree = html.fromstring(page.text)

The tree variable now contains all of the html document, which you can parse however you wish. tree变量现在包含所有html文档,您可以根据需要进行解析。 Using Xpath would have something like 使用Xpath会有类似

scores = tree.xpath('//td[@class="tdB"]/text()')

Hope that helps. 希望能有所帮助。

source 资源

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM