简体   繁体   English

使用Python Request / lxml进行Web爬网:从ul / li获取数据

[英]Web Scraping with Python Request/lxml: Getting data from ul/li

so I'm pretty new to this, and I haven't been able to find anything on google on this question. 因此,我对此很陌生,并且在这个问题上我一直无法在Google上找到任何东西。

I'm using request and lxml with Python , I've seen that there's a lot of different modules for web scraping, but is there any reason to choose one over the other? 我在Python中使用requestlxml ,我已经看到有很多用于Web抓取的不同模块,但是是否有理由选择一个? Can you do the same stuff with requests/lxml as you can with for example BeautifulSoup? 您能否像使用BeautifulSoup那样使用request / lxml做同样的事情?

Anyway, here's my actual question; 无论如何,这是我的实际问题;

This is my code: 这是我的代码:

import requests
from lxml import html

# Login data
inputUrl = 'http://forum.mytestsite.com/login'
usr = 'myusername'
pwd = 'mypassword'
payload = dict(login=usr, password=pwd)

# Open session
with requests.Session() as s:
    # Login
    s.post(inputUrl, data=payload)

    # Get page data
    pageResult = s.get('http://forum.mytestsite.com/icons/', allow_redirects=False)
    pageResult = html.fromstring(pageResult.content)
    pageIcons = pageResult.xpath('//script[@id="table-icons"]/text()')
    print pageIcons[0]

The result when printing pageIcons[0] : 打印pageIcons [0]的结果

<ul id="icons">
{{#each icons}}
   <li data-handle="{{handle}}">
     <img src="{{image_path}}" alt="{{desc_or_name this}}" title="{{desc_or_name this}}">
   </li>
{{/each}}
</ul>


This is the website/js code that generates the icons: 这是生成图标的网站/ js代码:

<script id="table-icons" type="text/x-handlebars-template">
  <ul id="icons">
    {{#each icons}}
       <li data-handle="{{handle}}">
         <img src="{{image_path}}" alt="{{desc_or_name this}}" title="{{desc_or_name this}}">
       </li>
    {{/each}}
  </ul>
</script>

And here's the result on the page: 这是页面上的结果:

<ul id="icons">
    <li data-handle="558FSTBI" class="">
        <img src="http://testsite.com/icons/558FSTBI.1.png" alt="Icon 1" title="Icon 1">
    </li>
    <li data-handle="310AYTZI">
        <img src="http://testsite.com/icons/310AYTZI.1.png" alt="Icon 2" title="Icon 2">
    </li>
    <li data-handle="669PQXBI" class="">
        <img src="http://testsite.com/icons/669PQXBI.1.png" alt="Icon 3" title="Icon 3">
    </li>
</ul>



My goal: 我的目标:
What I would like to do is to retrieve all of li data-handles, but I haven't been able to figure out how to retrieve this data. 我想做的是检索所有li数据句柄,但是我还无法弄清楚如何检索这些数据。 So my goal is to retrieve all of the icon paths and their titles, could anyone help me out here? 因此,我的目标是检索所有图标路径及其标题,有人可以在这里帮助我吗? I'd really appreciate any help :) 我真的很感谢您的帮助:)

You aren't parsing the li or ul . 您没有解析liul

Start with this 从这个开始

//ul[@id='icons']/li/img

And from those elements, you can extract the individual information 从这些元素中,您可以提取单个信息

Regarding the first question, beautifulsoup optionally uses lxml. 关于第一个问题,beautifulsoup可以选择使用lxml。 If you don't think you need it, and are comfortable with XPath, don't worry about it. 如果您不认为自己需要它,并且对XPath感到满意,请不必担心。

However, since it's Javascript generating the page, you need a headless browser rather than requests library. 但是,由于使用Javascript生成页面,因此您需要一个无头的浏览器而不是请求库。

Get page generated with Javascript in Python 获取在Python中使用Javascript生成的页面

Reading dynamically generated web pages using python 使用python阅读动态生成的网页

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM