简体   繁体   English

用于锚标签的PHP RegEx(或Alt方法)

[英]PHP RegEx (or Alt Method) for Anchor tags

Ok I have to parse out a SOAP request and in the request some of the values are passed with (or inside) a Anchor tag. 好的,我必须解析一个SOAP请求,并在请求中将某些值与Anchor标记一起传递(或传递给内部)。 Looking for a RegEx (or alt method) to strip the tag and just return the value. 寻找正则表达式(或alt方法)以剥离标签并仅返回值。

// But item needs to be a RegEx of some sort, it's a field right now
if($sObject->list == 'item') {
   // Split on > this should be the end of the right side of the anchor tag
   $pieces = explode(">", $sObject->fields->$field);

   // Split on < this should be the closing anchor tag
   $piece = explode("<", $pieces[1]);

   $fields_string .= $piece[0] . "\n";
}

item is a field name but I would like to make this a RegEx to check for the Anchor tag instead of a specific field. item是一个字段名称,但我想使其成为RegEx来检查Anchor标签,而不是特定字段。

PHP has a strip_tags() function. PHP具有strip_tags()函数。

Alternatively you can use filter_var() with FILTER_SANITIZE_STRING . 或者,您可以将filter_var()FILTER_SANITIZE_STRING

Whatever you do don't parse HTML/XML with regular expressions. 无论做什么,都不会使用正则表达式来解析HTML / XML。 It's really error-prone and flaky. 这真的很容易出错而且容易出错。 PHP has at least 3 different parsers as standard ( SimpleXML , DOMDocument and XMLReader spring to mind). PHP至少有3个不同的解析器作为标准(想到的是SimpleXMLDOMDocumentXMLReader )。

I agree with cletus, using RegEx on HTML is bad practice because of how loose HTML is as a language (and I moan about PHP being too loose...). 我同意cletus的观点,在HTML上使用RegEx是不好的做法,因为HTML作为一种语言是多么的松散(我抱怨说PHP太松散了……)。 There are just so many ways you can variate a tag that unless you know that the document is standards-compliant / strict, it is sometimes just impossible to do. 可以使用多种方法来更改标签,除非您知道该文档是符合标准/严格的文档,否则有时是不可能的。 However, because I like a challenge that distracts me from work, here's how you might do it in RegEx! 但是,由于我喜欢让我分心的挑战,因此您可以在RegEx中做到这一点!

I'll split this up into sections, no point if all you see is a string and say, "Meh... It'll do..."! 我将其分成几部分,如果您看到的只是一个字符串,然后说:“嗯...就可以了...”,这毫无意义! First we have the main RegEx for an anchor tag: 首先,我们有一个锚标签的主要RegEx:

'#<a></a>#'

Then we add in the text that could be between the tags. 然后,我们添加标签之间的文本。 We want to group this is parenthesis, so we can extract the string, and the question mark makes the asterix wildcard "un-greedy", meaning that the first </a> that it comes accross will be the one it uses to end the RegEx. 我们希望将其分组为括号,因此我们可以提取字符串,并且问号使星号通配符“ un-greedy”,这意味着它遇到的第一个</a>将是它用来结束正则表达式。

'#<a>(.*?)</a>#'

Next we add in the RegEx for href="". 接下来,我们为Reg =“”添加RegEx。 We match the href=" as plain text, then an any-length string that does not contain a quotation mark, then the ending quotation mark. 我们将href="作为纯文本进行匹配,然后匹配不包含引号的任意长度的字符串,然后匹配引号。

'#<a href\="([^"]*)">(.*?)</a>#'

Now we just need to say that the tag is allowed other attributes. 现在我们只需要说标签允许其他属性。 According to the specification, an attribute can contain the following characters: [a-zA-Z_\\:][a-zA-Z0-9_\\:\\.-]* . 根据规范,属性可以包含以下字符: [a-zA-Z_\\:][a-zA-Z0-9_\\:\\.-]* Allow an attribute multiple times, and with a value, we get: ( [a-zA-Z_\\:][a-zA-Z0-9_\\:\\.-]*\\="[^"]*")* . 多次允许一个属性,并使用一个值,我们得到: ( [a-zA-Z_\\:][a-zA-Z0-9_\\:\\.-]*\\="[^"]*")*

The resulting RegEx (PCRE) is as following: 生成的RegEx(PCRE)如下:

'#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#'

Now, in PHP, use the preg_match_all() function to grab all occurances in the string. 现在,在PHP中,使用preg_match_all()函数可捕获字符串中所有出现的事件。

$regex = '#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#';
preg_match_all($regex, $str_containing_anchors, $result);
foreach($result as $link)
 {
  $href = $link[2];
  $text = $link[4];
 }

使用simplexml和xpath检索所需的节点

If you want to strip or extract properties from only specific tag, you should try DOMDocument . 如果只想从特定标记中剥离或提取属性,则应尝试使用DOMDocument

Something like this: 像这样:


$TagWhiteList = array(
    // Example of WhiteList
    'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);

function getTextFromNode($Node, $Text = "") {
    // No tag, so it is a text
    if ($Node->tagName == null)
        return $Text.$Node->textContent;

    // You may select a tag here
    // Like:
    // if (in_array($TextName, $TagWhiteList)) 
    //     DoSomthingWithIt($Text,$Node);

    // Recursive to child
    $Node = $Node->firstChild;
    if ($Node != null)
        $Text = getTextFromNode($Node, $Text);

    // Recursive to sibling
    while($Node->nextSibling != null) {
        $Text = getTextFromNode($Node->nextSibling, $Text);
        $Node = $Node->nextSibling;
    }
    return $Text;
}

function getTextFromDocument($DOMDoc) {
    return getTextFromNode($DOMDoc->documentElement);
}

To use: 使用方法:


$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");

$Text = getTextFromDocument($Doc); echo "Text from HTML: ".$Text."\n";

The above function is how to strip tags. 上面的功能是如何剥离标签。 But you can modify it a bit to manipulate the element. 但是您可以对其进行一些修改以操纵该元素。 For example, if the tag is 'a' of archor, you can extract its target and display it instead of the text inside. 例如,如果标记是Archor的“ a”,则可以提取其目标并显示它而不是其中的文本。

Hope this help. 希望能有所帮助。

If you don't have some kind of request<->class mapping you can extract the information with the DOM extension . 如果您没有某种request <-> class映射,则可以提取带有DOM扩展名的信息。 The property textConent contains all the text of the context node and its descendants. 属性textConent包含上下文节点及其后代的所有文本。

$sr = '<?xml version="1.0"?>
<SOAP:Envelope xmlns:SOAP="urn:schemas-xmlsoap-org:soap.v1">
  <SOAP:Body>
    <foo:bar xmlns:foo="urn:yaddayadda">
       <fragment>
         <a href="....">Mary</a> had a
         little <a href="....">lamb</a>
       </fragment>
    </foo:bar>
  </SOAP:Body>
</SOAP:Envelope>';

$doc = new DOMDocument;
$doc->loadxml($sr);

$xpath = new DOMXPath($doc);
$ns = $xpath->query('//fragment');
if ( 0 < $ns->length ) {
  echo $ns->item(0)->nodeValue;
}

prints 版画

Mary had a
little lamb

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM