$main_url="http://programming.com";
$str = file_get_contents($main_url);
// Gets Webpage Title
if(strlen($str)>0)
{
$str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
$title=$title[1];
}
// Gets Webpage Description
$b =$main_url;
@$url = parse_url( $b );
@$tags = get_meta_tags($url['scheme'].'://'.$url['host'] );
$description=$tags['description'];
// Gets Webpage Internal Links
$doc = new DOMDocument;
@$doc->loadHTML($str);
$items = $doc->getElementsByTagName('a');
foreach($items as $value)
{
$attrs = $value->attributes;
$sec_url[]=$attrs->getNamedItem('href')->nodeValue;
}
/*foreach ($sec_url as $value) {
print_r($value);
?>
<br>
<?php
}*/
foreach($sec_url as $value)
{
$sq2 = "insert into datascience (link,title,description,internal_link)
values('$main_url','$title','$description','$value')";
$res= mysqli_query($conn, $sq2);
I've converted all of the various methods your using to find various details (title etc.) to using XPath within the loaded document. This just makes things consistent.
The main thing I find is to have to work out a consistent way of fetching the details. In the page your using, each segment looks as though it's wrapped up in an <article>
tag. So first fetch all of these tags, then using this as a base, looks for the various items your after.
Then building XPath expressions to locate them within each <article>
means you can pick all of the relevant details per item. In XPath - you use the descendant
axis ( descendant::...
) to indicate you want the nodes inside the context node (passed in as the last parameter to evaluate()
)..
$main_url="http://programming.com";
$str = file_get_contents($main_url);
$doc = new DOMDocument;
libxml_use_internal_errors(true);
$doc->loadHTML($str);
$xp = new DOMXPath($doc);
$title = $doc->getElementsByTagName("title")[0]->textContent;
$description = $xp->evaluate("string(//meta[@name='description']/@content)");
echo $title.PHP_EOL;
echo $description.PHP_EOL;
$articles = $doc->getElementsByTagName('article');
$pageArticles = [];
foreach($articles as $article) {
$articleTitle = $xp->evaluate("string(descendant::span[@title='Views'])", $article);
$articleViews = $xp->evaluate("string(descendant::h2[@class='title'])", $article);
$pageArticles[] = ["title" => $articleTitle, "views" => $articleViews];
}
print_r($pageArticles);
Which just gave me as output...
Tutorials - Programming.com
Tap into the collective intelligence of researchers who are working on the same problems you are - right now.
Array
(
[0] => Array
(
[title] => 1,031
[views] => HTML Cheat Sheet
)
[1] => Array
(
[title] => 390
[views] => Best Java Training Institutes In Noida
)
[2] => Array
(
[title] => 329
[views] => Best Salesforce Training institutes in noida
)
[3] => Array
(
[title] => 382
[views] => Top Quality Digital Marketing Training Institutes in Noida
)
[4] => Array
(
[title] => 308
[views] => Make your studies with professional Best Oracle Training Institutes in Noida
)
[5] => Array
(
[title] => 374
[views] => Create a Unique Project with a Best Linux Training Institutes in Noida
)
[6] => Array
(
[title] => 385
[views] => Webtrackker Technology Best Dot Net Training Institutes Available To Guide the Students
)
[7] => Array
(
[title] => 430
[views] => Availability of My University Help Offers Great Benefit to Students
)
[8] => Array
(
[title] => 350
[views] => Webtrackker Institute of Professional Studies: Hadoop Training Institute in Noida
)
[9] => Array
(
[title] => 416
[views] => The Best Quality Digital Marketing Training Institutes in Noida
)
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.