简体   繁体   中英

How to extract URLs from this site with xpath and scrapy using href?

I'm still getting the gist of xpath and how it works (have been trying to learn from w3 for a while) but I'm sort of confused how to extract this section of code from this webpage: https://www.pro-football-reference.com/years/2005/ (I've been looking through the source here: view-source: https://www.pro-football-reference.com/years/2005/ ). I would like to extract the URLs from lines 363 - 383.

<ul class="">
<li><a href="/years/2005/week_1.htm">Week 1</a></li>
<li><a href="/years/2005/week_2.htm">Week 2</a></li>
<li><a href="/years/2005/week_3.htm">Week 3</a></li>
<li><a href="/years/2005/week_4.htm">Week 4</a></li>
<li><a href="/years/2005/week_5.htm">Week 5</a></li>
<li><a href="/years/2005/week_6.htm">Week 6</a></li>
<li><a href="/years/2005/week_7.htm">Week 7</a></li>
<li><a href="/years/2005/week_8.htm">Week 8</a></li>
<li><a href="/years/2005/week_9.htm">Week 9</a></li>
<li><a href="/years/2005/week_10.htm">Week 10</a></li>
<li><a href="/years/2005/week_11.htm">Week 11</a></li>
<li><a href="/years/2005/week_12.htm">Week 12</a></li>
<li><a href="/years/2005/week_13.htm">Week 13</a></li>
<li><a href="/years/2005/week_14.htm">Week 14</a></li>
<li><a href="/years/2005/week_15.htm">Week 15</a></li>
<li><a href="/years/2005/week_16.htm">Week 16</a></li>
<li><a href="/years/2005/week_17.htm">Week 17</a></li>
<li><a href="/years/2005/week_18.htm">Wild Card</a></li>
<li><a href="/years/2005/week_19.htm">Divisional</a></li>
<li><a href="/years/2005/week_20.htm">Conf Champ</a></li>
<li><a href="/years/2005/week_21.htm">Super Bowl</a></li>
</ul>

I've tried using $x('//ul[@class=""]/@href') in the . console but it doesn't really work. Could someone help me extract the href from these? Any help or advice would be greatly appreciated!

There are two similar ways to parse the hrefs there.

A shorter (but more error prone, depending on how the rest of your HTML looks like) x('//ul[@class=""]//a/@href')

Meaning: Any "a" being a descendant (direct or not) of Any "ul" node with an empty class attribute.

A slightly longer expression, but less error prone as it is more explicit x('//ul[@class=""]/li/a/@href')

Meaning: Any "a" node being a direct descendant of any "li" node being a direct descendant of any "ul" node with an empty class attribute.

Additionally, you can try to refer to fancier xpath functions (not supported by every framework though) such as string length (for the class attribute).

"//" will select any descendant that matches while "/" only selects direct descendants that match. Since is not a direct descendant of , I think you're selector should be this:

$x('//ul[@class=""]//@href')

To get just the elements where the inner text starts with "Week":

$x('//ul[@class=""]//a[starts-with(.,"Week")]/@href')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM