简体   繁体   中英

How to grab title and content of web page

I have a web-page, for example, http://example.com/some-page . If I pass this URL to my PHP function, it should grab the title and content of the page. I've tried to grab the title like this:

function page_title($url) {
    $page = @file_get_contents($url);
    if (preg_match('~<h1 class="page-title">(.*)<\/h1>~is', $page, $matches)) {
        return $matches[0];
    }
}

echo page_title('http://example.com/some-page');  

What is my mistake?

Your function actually works almost. I would propose the DOM parser solution (see below), but before doing that I will point out a few weaknesses in the regular expression and code:

  • the (.*) capture group is greedy, ie it will catch a string that is as long as possible before a closing </h1> , even across line breaks (because of the s modifier). So if your document has multiple h1 tags it would capture up until the last one! You could fix this, by making it a lazy capture: (.*?)

  • the actual page may have other tags, like a span , inside the title. You might want to improve the regular expression to exclude any tags that surround your title, but PHP has a function strip_tags for that purpose.

  • Ensure that the file contents were actually retrieved; an error might have prevented correct retrieval, or your server might not allow such retrieval. And as you suppress errors using the @ prefix, you will maybe miss them. I would suggest removing the @ . You could also check the return value for false .
  • are you sure you want the H1 tag contents? A page has often a specific title tag.

The above improvements will give you this code:

function page_title($url) {
    $page = file_get_contents($url);
    if ($page===false) {
        echo "Failed to retrieve $url";
    }
    if (preg_match('~<h1 class="page-title">(.*?)<\/h1>~is', $page, $matches)) {
        return strip_tags($matches[0]);
    }
}

Although this works, you will sooner or later bump into a document that has an extra space in the h1 tag, or has another attribute before class , or has more than one css style, etc... making the match fail. The following regular expression will deal with some of these problems:

~<h1\s+class\s*=\s*"([^" ]* )?page-title( [^"]*)?"[^>]*>(.*?)<\/h1\s*>~is

... but still the class attribute has to come before any other attributes, and its value must be enclosed in double quotes. Also that could be solved, but the regular expression will become a monster.

The DOM way

Regular expressions are not the ideal way to extract content from HTML. Here is an alternative function based on DOM parsing:

function xpage_title($url) {
    // Create a new DOM Document to hold our webpage structure
    $xml = new DOMDocument();

    // Load the url's contents into the DOM, ignore warnings
    libxml_use_internal_errors(true);
    $success = $xml->loadHTMLFile($url);
    libxml_use_internal_errors(false);
    if (!$success) {
        echo "Failed to open $url.";
        return;
    }

    // Find first h1 with class 'page-title' and return it's text contents
    foreach($xml->getElementsByTagName('h1') as $h1) {
        // Does it have the desired class?
        if (in_array('page-title', explode(" ", $h1->getAttribute('class')))) {
            return $h1->textContent;
        }
    }
}

The above could be still improved by making use of DOMXpath .

EDIT

You mentioned in comments you actually don't want the contents of the H1 tag because it contains more text than you want.

Then you could read the title tag and the article tag contents:

function page_title_and_content($url) {
    $page = file_get_contents($url);
    if ($page===false) {
        echo "Failed to retrieve $url";
    }
    // PHP 5.4: $result = (object) ["title" => null, "content" => null];
    $result = new stdClass();
    $result->title = null;
    $result->content = null;
    if (preg_match('~\<title\>(.*?)\<\/title\>~is', $page, $matches)) {
        $result->title = $matches[1];
    }
    if (preg_match('~<article>(.*)<\/article>~is', $page, $matches)) {
        $result->content = $matches[1];
    }
    return $result;
}

$result = page_title_and_content('http://www.example.com/example');
echo "title: " . $result->title . "<br>";
echo "content: <br>" . $result->content . "<br>";

The above code will return an object with two properties: title and content . Note that the content property will have HTML tags, with potentially images and such. If you don't want tags, then apply strip_tags .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM