如何使用Selenium和Python从div中收集这些数据

Question

I have been using Selenium and Python to scrape a webpage and I am having difficulty collecting data that I want out of a div that has the following structure: 我一直在使用Selenium和Python来抓取一个网页，我很难从具有以下结构的div中收集我想要的数据：

<div class="col span_6" style="margin-left: 12px;width: 47% !important;">
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Heading1</span>
    <span class="MainGridcolumn2">Text that I want</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Another heading</span>
    <span class="MainGridcolumn2">More text that I want</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Next heading</span>
    <span class="MainGridcolumn2">Even more text</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Yet another heading</span>
    <span class="MainGridcolumn2">Piece of text</span>
  </div>
</div>

The div has a number of rows, each with 2 columns containing the data/text inside of span tags. div有许多行，每行包含2列，包含span标记内的数据/文本。 There are no CSS ids. 没有CSS ID。

I'm only interested in collecting the text contained within the 'MainGridcolumn2' span classes. 我只对收集'MainGridcolumn2'span类中包含的文本感兴趣。

I've tried the below to navigate to the first heading, with the intention of then trying to use 'following_sibling' to move down to the next span tag containing the text, but I can't even get this to work as it isn't returning any text when I try to print it to the console: 我已经尝试过以下导航到第一个标题，然后尝试使用'following_sibling'向下移动到包含文本的下一个span标记，但我甚至无法使其工作，因为它不是'当我尝试将其打印到控制台时返回任何文本：

driver.find_element_by_xpath("//span['@class=MainGridcolumn1'][contains(text(), 'Heading1')]").text

and 和

driver.find_element_by_xpath("//span[contains(text(), 'Heading1')]").text

Answer 1

One way would be to get the the enclosing div ie the grandparent and pull the spans from that: 一种方法是获得封闭的div，即祖父母，并从中拉出跨度：

h = """<div class="col span_6" style="margin-left: 12px;width: 47% !important;">
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Heading1</span>
    <span class="MainGridcolumn2">Text that I want</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Another heading</span>
    <span class="MainGridcolumn2">More text that I want</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Next heading</span>
    <span class="MainGridcolumn2">Even more text</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Yet another heading</span>
    <span class="MainGridcolumn2">Piece of text</span>
  </div>
</div>

  <div class="MainGridRow">
    <span class="MainGridcolumn1">Yet another heading</span>
    <span class="MainGridcolumn2">Piece of text I don't want</span>
  </div>"""

from lxml import html

xm = html.fromstring(h)
div = xm.xpath("//span[@class='MainGridcolumn1'][contains(text(), 'Heading1')]/../..")[0]
print(div.xpath(".//span[@class='MainGridcolumn2']/text()"))

Which would give you: 哪个会给你：

['Text that I want', 'More text that I want', 'Even more text', 'Piece of text']

You could also just select the parent and get the parents siblings 您也可以选择父母并获得父母的兄弟姐妹

from lxml import html

xm = html.fromstring(h)
div = xm.xpath("//span[@class='MainGridcolumn1'][contains(text(), 'Heading1')]/..")[0]
print(div.xpath(".//span[@class='MainGridcolumn2']/text() | .//following-sibling::div/span[@class='MainGridcolumn2']/text()"))

如何使用Selenium和Python从div中收集这些数据

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-07-03 21:50:52

如何使用Selenium和Python从div中收集这些数据

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-07-03 21:50:52

解决方案1
0 已采纳 2016-07-03 21:50:52