简体   繁体   English

XPath查询和HTML-在Anchor标签中查找特定的HREF

[英]XPath Query & HTML - Find Specific HREF's Within Anchor Tags

I've got the HTML data required in a DOMDocument and DOMXPath . 我已经在DOMDocumentDOMXPath获得了所需的HTML数据。

But I need to access and retrieve the href values in certain <a> tags. 但是我需要访问和检索某些<a>标记中的href值。 The following is the criteria: 以下是标准:

  1. href contains: some-site.vendor.com/jobs/[#idnumber]/job (ie some-site.vendor.com/jobs/23094/job ) href包含: some-site.vendor.com/jobs/[#idnumber]/job [ some-site.vendor.com/jobs/23094/job ]/ some-site.vendor.com/jobs/[#idnumber]/job (即some-site.vendor.com/jobs/23094/job

  2. href contains not: some-site.vendor.com/jobs/search?search=pr2 href不包含: some-site.vendor.com/jobs/search?search=pr2

  3. href contains not: some-site.vendor.com/jobs/intro href不包含: some-site.vendor.com/jobs/intro

  4. href contains not: www.someothersite.com/ href不包含: www.someothersite.com/

  5. href contains not: media.someothersite.com/ href不包含: media.someothersite.com/

  6. href contains not: javascript:void(0) href不包含: javascript:void(0)

Either of these (similar) queries fetches everything but 4-6 - that's a good thing: 这些查询(类似)都可以获取4-6以外的所有内容-这是一件好事:

$joblinks = $xpath->query('//a[@href[contains(., "https://some-site.vendor.com/jobs/")]]');    
$joblinks = $xpath->query('//a[@href[contains(., "job")]]');

Ultimately however I need to access all the anchor tags containing href's like #1, and assign the actual href values within to a variable/array. 但是最终,我需要访问所有包含href的锚标记,例如#1,并将其中的实际href值分配给变量/数组。 Here's what I'm doing: 这是我在做什么:

$payload = fetchRemoteData(SPEC_SOURCE_URL);

// suppress warning(s) due to malformed markup
libxml_use_internal_errors(true);

// load the fetched contents
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($payload);

// parse and cache the required data elements
$xpath = new DOMXPath($dom);

//$joblinks = $xpath->query('//a[@href[contains(., "some-site.vendor.com/jobs/")]]');
$joblinks = $xpath->query('//a[@href[contains(., "job")]]');
foreach($joblinks as $joblink) {
    var_dump(trim($joblink->nodeValue)); // dump hrefs here!
}
echo "\n";

This is really beating me up - I'm close but I just can't seem to tweak the query correctly and/or access the actual href values. 这确实让我大吃一惊-我已经接近了,但是我似乎无法正确地调整查询和/或访问实际的href值。 My humblest apologies if I've not followed protocol of any sorts for this question... 如果对此问题我没有遵循任何协议,我深表歉意...

ANY/ALL help would be greatly appreciated! 任何/所有帮助将不胜感激! Thanx SO MUCH in advance! 非常感谢!

Doing this solely with xpath I would not suggest. 我不建议仅使用xpath这样做。 First of all you have a whitelist and a blacklist. 首先,您有一个白名单和一个黑名单。 It's not really clear what you want so I assume this can change over time. 目前尚不清楚您想要什么,所以我认为这会随着时间而改变。

So what you can do is to first select all href attributes in question and return the nodes. 因此,您可以做的是首先选择所有有问题的href属性,然后返回节点。 That's what Xpath is very good for, so let's use xpath: 这就是Xpath非常适合的用途,所以让我们使用xpath:

if (!$links = $xpath->query('//a/@href')) {
    throw new Exception('XPath query failed.');
}

You now have the common DOMNodeList in $links and it contains of zero or more DOMAttr elements as we have selected those. 现在,您在$links具有公共的DOMNodeList ,并且其中包含零个或多个DOMAttr元素,因为我们已经选择了这些元素。 These now needs the filtering you're looking for. 这些现在需要您要查找的过滤器。

So you have some critera you want to match. 因此,您有一些想要匹配的黄水晶。 You have verbose but not very specific how that should work. 您有冗长但不太明确的说明。 You have a positive match but also negative matches. 您有正面比赛,也有负面比赛。 But in both cases you don't tell what should happen if not. 但是,在两种情况下,您都不会告诉我们如果没有发生该怎么办。 So I do a shortcut here: You write yourself a function that returns either true or false if a "href" string matches the criteria(s): 因此,我在这里做一个快捷方式:您编写了一个函数,如果"href"字符串与条件匹配,则该函数返回truefalse

function is_valid_href($href) {

    // do whatever you see fit ...

    return true or false;
}

So the problem of telling whether a href is now valid or not has been solved. 因此,判断href是否有效的问题已经解决。 Best thing: You can change it later. 最好的事情:您可以稍后进行更改。

So all what's needed is to integrate that with the links is to get all links in their normalized and absolute form. 因此,所需要做的就是将其与链接集成在一起,以使所有链接都具有其标准化和绝对形式。 This means more data processing, see: 这意味着更多的数据处理,请参阅:

for more details about the different types of URL normalization. 有关不同类型的URL规范化的更多详细信息。

So we create another function that encapsulates away href normalization, base resolution and validation. 因此,我们创建了另一个封装了href规范化,基本解析和验证的功能。 In case the href is wrong, it just returns null , otherwise the normalized href: 万一href错误,则只返回null ,否则返回标准化的href:

function normalize_href($href, $base) {

    // do whatever is needed ...

    return null or "href string";
}

Let's put this together, in my case I even make the href a Net_URL2 instance so the validator can benefit from it. 让我们放在一起,就我而言,我什至将href设置为Net_URL2实例,以便验证程序可以从中受益。

Naturally if you wrap this up into closures or some classes, it get's a nicer interface. 自然地,如果将其包装到闭包或某些类中,它将得到一个更好的接口。 Also you couold consider to make the xpath expression a parameter as well: 您也应该考虑将xpath表达式也设置为参数:

// get all href
if (!$links = $xpath->query('//a/@href')) {
    throw new Exception('XPath query failed.');
}

// set a base URL
$base = 'https://stackoverflow.com/questions/9894956/xpath-query-html-find-specific-hrefs-within-anchor-tags';

/**
 * @return bool
 */
function is_valid_href($href) {    
    ...
}

/**
 * @return href
 */
function normalize_href($href, $base) {
    ...
}

$joblinks = array();
foreach ($links as $attr) {
    $href = normalize_href($attr->nodeValue, $base);
    if (is_valid_href($href)) {
        $joblinks[] = $href;
    }
}

// your result is in:
var_dump($joblinks);

I've run an example on this website, and the result is: 我在此网站上运行了一个示例,结果是:

array(122) {
  [0]=>
  object(Net_URL2)#129 (8) {
    ["_options":"Net_URL2":private]=>
    array(5) {
      ["strict"]=>
      bool(true)
      ["use_brackets"]=>
      bool(true)
      ["encode_keys"]=>
      bool(true)
      ["input_separator"]=>
      string(1) "&"
      ["output_separator"]=>
      string(1) "&"
    }
    ["_scheme":"Net_URL2":private]=>
    string(4) "http"
    ["_userinfo":"Net_URL2":private]=>
    bool(false)
    ["_host":"Net_URL2":private]=>
    string(17) "stackexchange.com"
    ["_port":"Net_URL2":private]=>
    bool(false)
    ["_path":"Net_URL2":private]=>
    string(1) "/"
    ["_query":"Net_URL2":private]=>
    bool(false)
    ["_fragment":"Net_URL2":private]=>
    bool(false)
  }
  [1]=> 

  ...

  [121]=>
  object(Net_URL2)#250 (8) {
    ["_options":"Net_URL2":private]=>
    array(5) {
      ["strict"]=>
      bool(true)
      ["use_brackets"]=>
      bool(true)
      ["encode_keys"]=>
      bool(true)
      ["input_separator"]=>
      string(1) "&"
      ["output_separator"]=>
      string(1) "&"
    }
    ["_scheme":"Net_URL2":private]=>
    string(4) "http"
    ["_userinfo":"Net_URL2":private]=>
    bool(false)
    ["_host":"Net_URL2":private]=>
    string(22) "blog.stackoverflow.com"
    ["_port":"Net_URL2":private]=>
    bool(false)
    ["_path":"Net_URL2":private]=>
    string(30) "/2009/06/attribution-required/"
    ["_query":"Net_URL2":private]=>
    bool(false)
    ["_fragment":"Net_URL2":private]=>
    bool(false)
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM