如何提取网页摘要？

Question

我正在编写代码以从arxiv页（例如http://arxiv.org/abs/1207.0102页）中提取摘要，我有兴趣将文本从“我们研究...的模型”提取为“ ...罗盘-海森堡模型。” 我的代码目前看起来像

$url="http://arxiv.org/abs/1207.0102";
$options = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"User-Agent: Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko\r\n"
  )
);
$context = stream_context_create($options);
$str = file_get_contents($url, false, $context);

if (preg_match('~<body[^>]*>(.*?)</body>~si', $str, $body))
{
    echo $body[1];
}

问题在于它提取了body标签中的所有内容。 有没有办法只提取摘要？

Answer 1

最好的选择是使用DOM解析器，php在http://php.net/manual/en/class.domdocument.php内置了一个解析器，但是也有很多类似的类。

使用DOM文档，您将执行以下操作：

<?php
  $doc = new DOMDocument();
  $doc->loadHTML("<html><body>Test<br></body></html>");
  $text = $doc->getElementById("abstract");
?>

另一个选择是使用正则表达式，这看起来就像您已经在做的一样。 如您所知，它有点凌乱，需要一些学习， http：//www.regular-expressions.info/tutorial.html

谢谢。

如何提取网页摘要？

问题描述

1 个解决方案

解决方案1
1 2015-08-15 21:38:09

如何提取网页摘要？

问题描述

1 个解决方案

解决方案1 1 2015-08-15 21:38:09

解决方案1
1 2015-08-15 21:38:09