简体   繁体   English

如何在Scrapy中使用response.XPath从多个标签提取文本数据?

[英]How to extract text data from multiple tags using response.XPath in Scrapy?

I stuck with one problem. 我坚持一个问题。 I want to extract text from following HTML using XPath in scrapy. 我想在Scrapy中使用XPath从以下HTML提取文本。

<div class="block fix-text job-description">
   <p>We’re looking for an experienced <strong>Events Manager</strong> to develop and deliver our events and exhibitions programme, available to start as soon as possible. You’ll be leading a team of two to create and implement an events strategy that supports our corporate objectives. You’ll be working closely with our campaigns, marketing and projects teams to make sure we connect with our audiences and achieve event objectives.</p>
   <p>In this role, you’ll be working within a dynamic team in a fast-paced environment, with the potential opportunity to be part of the recruitment process to build your own team. Your experience as an events manager will have a strong marketing or digital marketing focus, ideally within a regulatory or third sector context.</p>
   <p>You’ll be managing high profile events across our diverse organisation, from workshops and online webinars to our national flagship conference. It’s an exciting role with the opportunity to help shape our current digital transformation and strengthen our brand, so we’re looking for creativity and innovation. You’ll also be working with senior colleagues and stakeholders, for whom you’ll prepare detailed briefings. In addition, you:</p>
   <ul>
      <li>Can demonstrate your extensive experience of creating and managing high profile events and conferences</li>
      <li>Have experience in delivering complex events programmes integrated into campaigns and marketing communications</li>
      <li>Have experience of audience research and insight</li>
      <li>Have excellent budget management and negotiation skills</li>
      <li>Are an outstanding communicator, both verbal and written</li>
      <li>Have strong people management skills with the ability to motivate and develop a team remotely</li>
   </ul>
   <p>This role is the opportunity to work within one of the largest healthcare regulators within the UK, shaping change within healthcare. As part of your salary and benefits package, you’ll receive:</p>
   <ul>
      <li>A good pension (15% employer contribution)</li>
      <li>25 days’ holiday a year (option to buy &amp; sell)</li>
      <li>Private Medical Insurance (PMI) &amp; Health screens</li>
      <li>Interest free ticket loans</li>
      <li>Exclusive discounts</li>
      <li>Employee assistance programme</li>
      <li>Childcare vouchers</li>
      <li>Cycle to work scheme</li>
      <li>Flexi-working</li>
      <li>The option to work from home up to 2 days a week.</li>
   </ul>
   <p>The General Medical Council (GMC) helps to protect patients and improve medical education and practice in the UK by setting standards for medical students and doctors. We support them in achieving (and exceeding) those standards and take action when they’re not met.</p>
   <p>A registered charity, we value diversity and inclusion because our differences make us stronger. So, our processes are fair, objective, transparent and free from discrimination.</p>
   <p><strong>Employment status: 12-month Fixed Term Contract</strong></p>
   <p><strong>Closing date: Midnight on Sunday 1st July 2018, late applications will not be accepted.</strong></p>
   <p><strong>Assessment date: Interviews &amp; Assessments will take place on Wednesday 11th July 2018</strong></p>
</div>

How to extract text from above HTML. 如何从HTML上方提取文本。 I tried following XPath for extract text 我尝试按照XPath提取文本

  1. '//*[@class=“job-description”]' '// * [@类=‘作业描述’]'

  2. //[@id=“main”]/div/div/div[1]/div[1]/div/div[2]/div[2]//text() // [@ ID =“主”] / DIV / DIV / DIV [1] / DIV [1] / DIV / DIV [2] / DIV [2] //文本()

  3. //[@id=“main”]//div[@class=“job-description”]/' // [@ ID =“主要”] // DIV [@类=“作业描述”] /”
  4. //div[@class=“job-description”]/p/text() // DIV [@类=“作业描述”] / P /文本()
  5. '//div[@class="job-description"]/following-sibling::node()/descendant-or-self::text()' “// DIV [@类=‘作业描述’] /以下同胞::节点()/后代或自身::文本()”

6.'//div[@class="job-description"]/p/descendant-or-self::text()' 6。 '// DIV [@类= “作业描述”] / P /后代或自身::文本()'

But Didn't get Output can anyone please tell me how to scrape this information because it has multiple {p} tags, (ul} tags inside class. 但是没有得到Output的人可以告诉我如何抓取此信息,因为它在类中有多个{p}标签, (ul}标签。

so now I am confuse how to get information. 所以现在我很困惑如何获取信息。

Thanks in Advance 提前致谢

It's not really very clear what it is you want, but it sounds like you want an XPath query that gives you all the text nodes. 并不是很清楚您想要的是什么,但是听起来您想要一个可以为您提供所有文本节点的XPath查询。 That you can do like this: 您可以这样做:

/descendant::text()

I resolved this problem from following Answer: 我通过以下答案解决了这个问题:

I just put following xpath: //*[contains(@class,"job-description")]/descendant::text() 我只是把下面的xpath放在: //*[contains(@class,"job-description")]/descendant::text()

Thanks for you comment @Lars Marius Garshol. 感谢您的评论@Lars Marius Garshol。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM