简体   繁体   English

如何在Webharvest中不使用xpath选择somedata

[英]how to not select somedata using xpath in webharvest

I am using webharvest with xquery to get a data from a website. 我正在将webharvest与xquery结合使用,以从网站获取数据。

I have the 2 xquery variables with the following data 我有2个xquery变量,包含以下数据

$text : $text

<p> <strong>Psoria-Shield Inc.</strong> (<a href="http://www.psoria-shield.com/"></a><a href="/Tracker?data=gB90UgQvS9bs99znBBkklh-mudx4NTcPFIy_wiP7zUJ-qBXYABNid0GYgW4g7qVsjn3_dv2FPGzaYgKnhq_Ujg%3D%3D" target="_top">www.psoria-shield.com</a>) is a Tampa FL based company specializing in design, manufacturing, and distribution of medical devices to domestic and international
                  markets. PSI employs full-time engineering, production, sales staff, and manufactures within an ISO 13485 certified quality
                  system. PSI's flagship product, Psoria-Light&#174;, is FDA-cleared and CE marked and delivers targeted UV phototherapy for
                  the treatment of certain skin disorders. Psoria-Shield Inc., was acquired by Wellness Center USA Inc. ("WCUI") in August 2012,
                  and is now a wholly-owned subsidiary.
               </p> 
               <p> <strong>AminoFactory</strong> (<a href="http://www.aminofactory.com/"></a><a href="/Tracker?data=O0xbFRJiVuWDzRDq7SVwVR9xAPYLIGQyBw4mDziUrH4KB3DIYUasiO_O78eteJsv2doAGtg4kRhAqmnvkQ-9LA%3D%3D" target="_top">www.aminofactory.com</a>), a division of Wellness Center USA, Inc., is an online supplement store that markets and sells a wide range of high-quality
                  nutritional vitamins and supplements. By utilizing AminoFactory's online catalog, bodybuilders, athletes, and health conscious
                  consumers can choose and purchase the highest quality nutritional products from a wide array of offerings in just a few clicks.
               </p> 
                <pre>At Wellness Center Usa, Inc.
Tel: (847) 925-1885 <a href="/Tracker?data=rhuzXSqaPgDJ--ByIIMSm7wrtVUZmqiD7wl78d4gUHajkKceardtmAscrHABzvo360XXBJCWn_Rb_s-yPMVXTw_XJrSieD88bIXbE9snPn4%3D" target="_top">www.wellnescenterusa.com</a> Investor Relations Contact:
Arthur Douglas &amp; Associates, Inc.
Arthur Batson
Phone: 407-478-1120 <a href="/Tracker?data=9uKwR5tr9QwjFw830lvFTIWgz-s_eHaywZHwDl3el2RfYe5VuQZd_8sJU4J7HoFgOdyCn8br77RK60SIqLZkCy468cEKHpGUgE-nanwYfHo%3D" target="_top">www.arthurdouglasinc.com</a></pre> </span><span class="dt-green">

and $contact : $contact

At Wellness Center Usa, Inc.
Tel: (847) 925-1885 <a href="/Tracker?data=rhuzXSqaPgDJ--ByIIMSm7wrtVUZmqiD7wl78d4gUHajkKceardtmAscrHABzvo360XXBJCWn_Rb_s-yPMVXTw_XJrSieD88bIXbE9snPn4%3D" target="_top">www.wellnescenterusa.com</a> Investor Relations Contact:
Arthur Douglas &amp; Associates, Inc.
Arthur Batson
Phone: 407-478-1120 <a href="/Tracker?data=9uKwR5tr9QwjFw830lvFTIWgz-s_eHaywZHwDl3el2RfYe5VuQZd_8sJU4J7HoFgOdyCn8br77RK60SIqLZkCy468cEKHpGUgE-nanwYfHo%3D" target="_top">www.arthurdouglasinc.com</a>

(This above text is just a example.) (以上文本只是一个示例。)

What I want to so is remove the content of $contact from $text so far I have come up with the following code: 我想这样是删除的内容$contact$text到目前为止,我想出了下面的代码:

{
    for $x in $text
        return if(matches($contact, '')) then $x
            else if(matches($contact, $x)) then  '' else $x 
}

It is not working. 它不起作用。 I dont know where I am going wrong. 我不知道我要去哪里错了。 Please let me know the right way of doing this. 请让我知道正确的方法。

Do not use matches(...) for exact string comparison, it is made for regular expressions and you'd need to escape a bunch of special characters. 不要使用matches(...)进行精确的字符串比较,它是为正则表达式编写的,您需要转义一堆特殊字符。

If the HTML subtree is the exact same, use this: 如果HTML子树完全相同,请使用以下命令:

$text[not(deep-equal(., <pre>{ $contact }</pre>))]

If you only want to compare its contents, use data(...) : 如果只想比较其内容,请使用data(...)

$text[not(data(.) = string-join(data($contact)))]

But given the data you posted, you'd be fine just removing all <pre/> nodes: 但是根据您发布的数据,只需删除所有<pre/>节点就可以了:

$text[local-name() != 'pre']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM