在PHP中使用DOMdocument（）和xpath提取元描述内容

Question

我正在尝试从页面中提取元描述内容，并在搜索结果中显示。 但是，它显示为：

content =“ Lorem ipsum dolor sit amet，consectetuer adipiscing elit。Aenean commodo ligula eget dolor。Aenean massa。Cum sociis natoque penatibus et magnis dis parturient montes，nascetur ridiculus mus。”

而我只想要：

Lorem ipsum dolor坐下来，管教着迷。 Aenean commodo ligula eget dolor。 艾尼安·马萨（Aenean Massa）。 兼收并蓄的蒙太奇，纳斯提尔ridiculus mus。

任何猜测，我的代码有什么问题吗？

代码：

  $doc = new DOMDocument();
  @$doc->loadHTMLFile($page_path);
  $xpath = new DOMXPath($doc);

  $body = $xpath->query('//meta[@name="description"]/@content');
  $page_title = @$doc->getElementsByTagName('title')->item(0)->textContent;
  $page_title = $page_title ? $page_title : $page_path;
  $page_body = html2text($doc->saveXml($body->item(0)));// this is meta-description, which i want

  Functions :

  function html2text($html)
  {
  $text = $html;
  static $search = array(
  '@<script.+?</script>@usi',  // Strip out javascript content
  '@<style.+?</style>@usi',    // Strip style content
  '@<!--.+?-->@us',            // Strip multi-line comments including CDATA
  '@</?[a-z].*?\>@usi',         // Strip out HTML tags
  );
  $text = preg_replace($search, ' ', $text);
  /*
  * normalize common entities
  */
  $text = normalizeEntities($text);
  /*
  * decode other entities
  */
  $text = html_entity_decode($text, ENT_QUOTES, 'utf-8');
  /*
  * normalize possibly repeated newlines, tabs, spaces to spaces
  */
  $text = preg_replace('/\s+/u', ' ', $text);
  $text = trim($text);
  return $text;
  }


  /**
  * Replace encoded and double encoded entities to equivalent unicode character
  * @param string $text
  * @return string - the same as $text but without encoded entries
  * @access public
  */
  function normalizeEntities($text)
  {
  static $find = array();
  static $repl = array();
  if (!count($find)) {
  /*
  * build $find and $replace from map one time
  */
  $map = array(
  array('\'', 'apos', 39, 'x27'), // Apostrophe
  array('\'', '‘', 'lsquo', 8216, 'x2018'), // Open single quote
  array('\'', '’', 'rsquo', 8217, 'x2019'), // Close single quote
  array('"', '“', 'ldquo', 8220, 'x201C'), // Open double quotes
  array('"', '”', 'rdquo', 8221, 'x201D'), // Close double quotes
  array('\'', '‚', 'sbquo', 8218, 'x201A'), // Single low-9 quote
  array('"', '„', 'bdquo', 8222, 'x201E'), // Double low-9 quote
  array('\'', '′', 'prime', 8242, 'x2032'), // Prime/minutes/feet
  array('"', '″', 'Prime', 8243, 'x2033'), // Double prime/seconds/inches
  array(' ', 'nbsp', 160, 'xA0'), // Non-breaking space
  array('-', '‐', 8208, 'x2010'), // Hyphen
  array('-', '–', 'ndash', 8211, 150, 'x2013'), // En dash
  array('--', '—', 'mdash', 8212, 151, 'x2014'), // Em dash
  array(' ', ' ', 'ensp', 8194, 'x2002'), // En space
  array(' ', ' ', 'emsp', 8195, 'x2003'), // Em space
  array(' ', ' ', 'thinsp', 8201, 'x2009'), // Thin space
  array('*', '•', 'bull', 8226, 'x2022'), // Bullet
  array('*', '‣', 8227, 'x2023'), // Triangular bullet
  array('...', '…', 'hellip', 8230, 'x2026'), // Horizontal ellipsis
  array('°', 'deg', 176, 'xB0'), // Degree
  array('€', 'euro', 8364, 'x20AC'), // Euro
  array('¥', 'yen', 165, 'xA5'), // Yen
  array('£', 'pound', 163, 'xA3'), // British Pound
  array('©', 'copy', 169, 'xA9'), // Copyright Sign
  array('®', 'reg', 174, 'xAE'), // Registered Sign
  array('™', 'trade', 8482, 'x2122') // TM Sign
  );
  foreach ($map as $e) {
  for ($i = 1; $i < count($e); ++$i) {
  $code = $e[$i];
  if (is_int($code)) {
  // numeric entity
  $regex = "/&(amp;)?#0*$code;/";
  } elseif (preg_match('/^.$/u', $code)/* one unicode char*/) {
  // single character
  $regex = "/$code/u";
  } elseif (preg_match('/^x([0-9A-F]{2}){1,2}$/i', $code)) {
  // hex entity
  $regex = "/&(amp;)?#x0*" . substr($code, 1) . ";/i";
  } else {
  // named entity
  $regex = "/&(amp;)?$code;/";
  }
  $find[] = $regex;
  $repl[] = $e[0];
  }
  }
  }
  return preg_replace($find, $repl, $text);
  }

Answer 1

您正在将属性节点另存为XML。 别！ 只需阅读其价值即可。

属性节点（DOMAttr）具有返回属性值的属性值。 该属性值是文本值。

$html = <<<'HTML'
<meta name="description" content="Some description">
HTML;

$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXPath($document);

$description = $xpath->evaluate('//meta[@name="description"]/@content')->item(0);
var_dump($description->value);

输出：

string(16) "Some description"

但是Xpath可以直接将值作为字符串返回。 只需转换结果（在Xpath中）。 这仅适用于DOMXpath::evaluate() 。 DOMXpath::query()只能返回节点列表。

$description = $xpath->evaluate('string(//meta[@name="description"]/@content)');
var_dump($description);

输出：

string(16) "Some description"

$html = <<<'HTML'
<html>
  <head>
    <title>The Title</title>
    <meta name="description" content="Some description">
  </head>
  <body>
    <p>Some content &amp; entities &copy;</p>
  </body>
</html>
HTML;

$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXPath($document);

$title = $xpath->evaluate('string(//head/title)');
$description = $xpath->evaluate('string(//meta[@name="description"]/@content)');
$content = $xpath->evaluate('string(//body)');

var_dump($title, $description, $content);

输出：

string(9) "The Title"
string(16) "Some description"
string(36) "
    Some content & entities ©
  "

在PHP中使用DOMdocument（）和xpath提取元描述内容

问题描述

1 个解决方案

解决方案1
0 2016-04-25 14:33:58

在PHP中使用DOMdocument（）和xpath提取元描述内容

问题描述

1 个解决方案

解决方案1 0 2016-04-25 14:33:58

解决方案1
0 2016-04-25 14:33:58