简体   繁体   English

html使用lxml抓取

[英]html scraping using lxml

I'm scrapping data using lxml 我正在使用lxml数据

This is the inspect element of single post 这是单个帖子的检查元素

<article id="post-4855" class="post-4855 post type-post status-publish format-standard hentry category-uncategorized">


<header class="entry-header">
    <h1 class="entry-title"><a href="http://aitplacements.com/uncategorized/cybage/" rel="bookmark">Cybage..</a></h1>
            <div class="entry-meta">
        <span class="byline"> Posted by <span class="author vcard"><a class="url fn n" href="http://aitplacements.com/author/tpoait/">TPO</a></span></span><span class="posted-on"> on <a href="http://aitplacements.com/uncategorized/cybage/" rel="bookmark"><time class="entry-date published updated" datetime="2017-09-13T11:02:32+00:00">September 13, 2017</time></a></span><span class="comments-link"> with <a href="http://aitplacements.com/uncategorized/cybage/#respond">0 Comment</a></span>      </div><!-- .entry-meta -->
        </header><!-- .entry-header -->

<div class="entry-content">
    <p>cybage placement details shared <a href="http://aitplacements.com/uncategorized/cybage/" class="read-more">READ MORE</a></p>
        </div><!-- .entry-content -->

For every such post, I want to extract title, content of post, and post timing. 对于每个此类帖子,我都希望提取标题,帖子内容和发布时间。

For example in above, the details will be 例如在上面,详细信息将是

{title : "Cybage..",
 post : "cybage placement details shared"
 datetime="2017-09-13T11:02:32+00:00"
}

Till now what I'm able to achieve: the website requires login, I'm successfull in doing that. 到现在为止,我已经能够实现:网站需要登录,我已经成功登录了。

For extracting information: 提取信息:

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) 
Chrome/42.0.2311.90'}
url = 'http://aitplacements.com/news/'
page = requests.get(url,headers=headers)
doc = html.fromstring(page.content)
#print doc # it prints <Element html at 0x7f59c38d2260>
raw_title = doc.xpath('//h1[@class="entry-title"]/a/@href/text()')
print raw_title

The raw_title gives empty value [] ? raw_title给出空值[]

What I'm doing wrong ? 我做错了什么?

@href refers to the value of the href attribute: @href引用href属性的值:

In [14]: doc.xpath('//h1[@class="entry-title"]/a/@href')
Out[14]: ['http://aitplacements.com/uncategorized/cybage/']

You want the text of the <a> element instead: 您需要<a>元素的文本:

In [16]: doc.xpath('//h1[@class="entry-title"]/a/text()')
Out[16]: ['Cybage..']

Therefore, use 因此,使用

raw_title = doc.xpath('//h1[@class="entry-title"]/a/text()')
if len(raw_title) > 0:
    raw_title = raw_title[0]
else:
    # handle the case of missing title
    raise ValueError('Missing title')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM