简体   繁体   中英

JSON parse, if Wikipedia has multiple options pick first page to display

The following code grabs the first paragraph from a Wikipedia page.

<?
// action=parse: get parsed text
// page=Baseball: from the page Baseball
// format=json: in json format
// prop=text: send the text content of the article
// section=0: top content of the page

$find = $_GET['find'];

$url = 'http://en.wikipedia.org/w/api.php?action=parse&page=baseball&format=json&prop=text&section=0';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "TestScript"); // required by wikipedia.org server; use YOUR user agent with YOUR contact information. (otherwise your IP might get blocked)
$c = curl_exec($ch);

$json = json_decode($c);

$content = $json->{'parse'}->{'text'}->{'*'}; // get the main text content of the query (it's parsed HTML)

// pattern for first match of a paragraph
$pattern = '#<p>(.*?)</p>#s'; // http://www.phpbuilder.com/board/showthread.php?t=10352690
if(preg_match_all($pattern, $content, $matches))
{
    // print $matches[0]; // content of the first paragraph (including wrapping <p> tag)
    echo "Wikipedia:<br>";
    print strip_tags(implode("\n\n",$matches[1])); // Content of the first paragraph without the HTML tags.
}
?>

The issue is that sometimes I want to make the title a variable in PHP so I can "search" for the information, but my query isn't always going to be a legitimate Wikipedia page.

For example, when the above code searches for baseball, there is a page for baseball. But when I search for "mandarin", it shows:

Mandarin may refer to any of the following:

But it doesn't show any options.

My question is, is there a way to check to see if the page exists, and if not, get a list of options from Wikipedia that it could be, then pick the first page to display?

Back in the 80's when referring to parsing XML and HTML documents, Nancy Reagan cried out:

Just Say No to REGEX!

Wait a minute! I might be mistaken on that. I think she may have said, "Just Say No to Drugs!" and I don't think she was probably thinking about XML or HTML documents when she said that. But if she were, I'm sure she would agree with me that parsing XML and HTML is better done with PHP's DomDocument class, for two reasons:

  • Regular expressions aren't very reliable for that purpose. A single character can throw them off, and any changes made by the webmaster to render your regex patterns useless.
  • Regular expressions are slow, especially if you have to get multiple items from the document. The DomDocument model parses the document once, and then all the data is contained in an object for easy access.

I went to the "Mandarin" page and found the following:

<h2>
    <span class="editsection">[<a href="/w/index.php?title=Mandarin&amp;action=edit&amp;section=1" title="Edit section: Officials">edit</a>]</span>
    <span class="mw-headline" id="Officials">Officials</span>
</h2>
<ul>
    <li><a href="/wiki/Mandarin_(bureaucrat)" title="Mandarin (bureaucrat)">Mandarin (bureaucrat)</a>, a bureaucrat of Imperial China (the original meaning of the word), Vietnam, and by analogy, any senior government bureaucrat</li>
</ul>

You can get the first link using the following code:

$doc = new DOMDocument();
//load HTML string into document object
if ( ! @$doc->loadHTML($data)){
    return FALSE;
}
//create XPath object using the document object as the parameter
$xpath = new DOMXPath($doc);
$query = "//span[@class='editsection']/a";
//XPath queries return a NodeList
$res = $xpath->query($query);
$link = $res->item(0)->getAttribute('href');

Once you have the URL, it's a simple matter to request the next page. As far as testing whether a page has this information or not, I think you can figure that out.

If you're going to be doing this sort of thing, it's well worth your while to learn about the DomDocument class and making xpath queries.

EDIT:

The variable $data is just a string containing the HTML from the page.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM