简体   繁体   English

如何使用PHP从HTML源代码中提取特定字符串

[英]How to Extract Particular String from the HTML Source code using PHP

I'm trying to extract particular string from the whole HTML source code. 我正在尝试从整个HTML源代码中提取特定的字符串。

HTML Source: view-source: https://www.instagram.com/p/BUbZXXMjnxY/?taken-by=narentrigger&hl=en HTML来源:查看来源: https://www.instagram.com/p/BUbZXXMjnxY/?taken-by=narentrigger&hl=enhttps://www.instagram.com/p/BUbZXXMjnxY/?taken-by=narentrigger&hl=en -by https://www.instagram.com/p/BUbZXXMjnxY/?taken-by=narentrigger&hl=en - https://www.instagram.com/p/BUbZXXMjnxY/?taken-by=narentrigger&hl=en

Need To Extract String: https://instagram.fmaa1-2.fna.fbcdn.net/t51.2885-15/e35/18645014_163619900839441_7821159798480568320_n.jpg From the "og:image" Meta Property. 需要提取字符串: https://instagram.fmaa1-2.fna.fbcdn.net/t51.2885-15/e35/18645014_163619900839441_7821159798480568320_n.jpg : https://instagram.fmaa1-2.fna.fbcdn.net/t51.2885-15/e35/18645014_163619900839441_7821159798480568320_n.jpg来自“ og:image”元属性。

i have tried some methods, but everything gone wrong. 我尝试了一些方法,但是一切都出错了。 Is there any way to grab the image link from the og:image meta property of the source code. 有什么方法可以从源代码的og:image meta属性中获取图像链接。 After extracting need to store the image url on a particular variable. 提取后需要将图像URL存储在特定变量上。 Expert helps needed. 需要专家帮助。 Url that need to extract 需要提取的网址

Don't use preg_match_all() if you are only grabbing one substring. 如果仅获取一个子字符串,请不要使用preg_match_all() Loading a DOMDocument seems like overkill for this task. 加载DOMDocument似乎对于此任务来说是过大的。

By using \\K you can reduce result array bloat. 通过使用\\K ,可以减少结果数组的膨胀。

Sample Input: 输入样例:

$input='<meta property="og:title" content="Instagram post by Narendiran blah blah" />
<meta property="og:image" content="https://instagram.fmma1-2.blah.jpg" />
<meta property="og:description" content="8 Likes, 1 Comments - blah" />';

Method ( Demo ): 方法( 演示 ):

$url=preg_match('/"og:image"[^"]+"\K[^"]+/',$input,$out)?$out[0]:null;
echo $url;

Output: 输出:

https://instagram.fmma1-2.blah.jpg

The regex engine will run more efficiently by using a negated character class. 通过使用否定的字符类,正则表达式引擎将更有效地运行。 [^"] . ( Pattern Demo ) [^"] 。( 模式演示

Assuming you have the markup inside a string with PHP, what's wrong with a RegEx ? 假设您在PHP的字符串中包含标记,那么RegEx什么问题

preg_match_all('/<meta.*property="og:image".*content="(.*)".*\/>/', $string, $matches);
echo $matches[1][0];

Demo 演示版

Disclaimer: more efficient regexes may be available . 免责声明:可能会提供更有效的正则表达式

In this code snippet I'm using DOMDocument to scrap the attribute content form the meta tag. 在此代码段中,我使用DOMDocument从meta标记中抓取属性内容。 It stores it in an Array in case there are more and returns it. 它将存储在数组中以防万一并返回。 Hope it works. 希望它能工作。

   function get_img_url($url) { 

        // Create a new DOM object 
        $html = new DOMDocument(); 

        // load the HTML page 
        $html->loadHTMLFile($url); 

        // create a empty array object 
        $imageArray = array(); 

        //Loop through each meta tag
        foreach($html->getElementsByTagName('meta') as $meta) { 
            $imageArray[] = array('url' => $meta->getAttribute('content')); 
        } 

        //Return the list 
        return $imageArray; 
    } 

Try this code to scrap webpage. 尝试使用此代码来抓取网页。 I used simple_html_dom_parser . 我使用了simple_html_dom_parser you can download it from https://sourceforge.net/projects/simplehtmldom/files/ 您可以从https://sourceforge.net/projects/simplehtmldom/files/下载

include_once("simple_html_dom.php");

$output_filename = "example_homepage.html";
$fp = fopen($output_filename, 'w');
$url = 'https://www.instagram.com/p/BUbZXXMjnxY/?taken-by=narentrigger&hl=en';
$curl = curl_init();

curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, false);
curl_setopt ($curl, CURLOPT_FILE, $fp);
$result = curl_exec($curl);

curl_close($curl);
fclose($fp);

$html = file_get_html('example_homepage.html');

foreach($html->find('meta[property=og:image]') as $element) 
   echo $element->content . '<br>';

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM