简体   繁体   English

preg_match模式以查找之间的字符串内容 <html> 和 </html> 标签

[英]preg_match pattern to find the contents of a string between <html> and </html> tags

I'm working on a PHP script that reads the content of emails, and pulls out certain information to store in a database. 我正在研究一个PHP脚本,该脚本读取电子邮件的内容,并提取某些信息以存储在数据库中。

Using imap_fetchbody ($imap_stream, $msg_number, 1), I'm able to get at the body of the email. 使用imap_fetchbody($ imap_stream,$ msg_number,1),我可以了解电子邮件的正文。 In some cases (especially email sent as SMS from mobile phones), the body of the email looks like this: 在某些情况下(尤其是从手机以SMS形式发送的电子邮件),电子邮件的主体如下所示:

===------=_Part_110734_170079945.1283532109852
Content-Type: text/html;charset=UTF-8;
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

<html> 
    <head> 
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> 
        <title>Multimedia Message</title> 
    </head> 
    <body leftmargin="0" topmargin="0"> 


                <tr height="15" style="border-top: 1px solid #0F7BBC;"> 
                    <td> 
                        SMS to email test
                    </td> 
                </tr> 


     </body> 
</html> 


------=_Part_110734_170079945.1283532109852--===

I want to pull out the "content" of the email. 我想提取电子邮件的“内容”。 So, my plan is this: 所以,我的计划是这样的:

Check to see if the body contains the "html" tags. 检查正文是否包含“ html”标签。 If not, I can read it normally (it's not an HTML email). 如果没有,我可以正常阅读(不是HTML电子邮件)。

If it does, extract the content between the "html" tags. 如果是这样,请提取“ html”标记之间的内容。 Then, eliminate all the other HTML tags, and the "content" is what's left. 然后,消除所有其他HTML标记,剩下的就是“内容”。

However, I'm pretty clueless when it comes to regex patterns. 但是,对于正则表达式模式,我一无所知。

I tried this: 我尝试了这个:

$pattern = '/<html[^>]*>(.*?)<\/html>/i';
preg_match($pattern, $body, $matches);
// my 'content' should be in $matches[1]

But that didn't work (probably because $body contains newlines and other whitespace). 但这是行不通的(可能是因为$ body包含换行符和其他空格)。 So then I tried this: 所以我尝试了这个:

$pattern = '/<html[^>]*>([.\s]*?)<\/html>/i';
preg_match($pattern, $body, $matches);

But that didn't work either. 但这也不起作用。

So, what $pattern can I use to extract all the text between the "html" tags? 那么,我可以使用什么$ pattern来提取“ html”标签之间的所有文本?

UPDATE: I've stumbled into a workaround - strip all the whitespace first: 更新:我偶然发现了一种解决方法-首先删除所有空白:

$body = preg_replace('/\s\s+/', ' ', $body);
$pattern = '/<body[^>]*>(.*?)<\/body>/';

I suspect this isn't the fastest or most efficient method, but it works, and is the best I've got so far. 我怀疑这不是最快或最有效的方法,但是它有效,并且是迄今为止我所能获得的最好的方法。 I'd still appreciate a better solution if there is one, though. 不过,如果有一个更好的解决方案,我仍然会感激不尽。

UPDATE 2: Thanks to Gumbo suggestions, I've tried a little harder to dig through the structure of the email to find the part I was looking for, instead of attempting to regex HTML. 更新2:感谢Gumbo的建议,我已经更加努力地研究了电子邮件的结构,以查找所需的部分,而不是尝试对HTML进行正则表达式。 I finally found this: http://docstore.mik.ua/orelly/webprog/pcook/ch17_04.htm , which explains how to do exactly what I needed. 我终于找到了这个: http : //docstore.mik.ua/orelly/webprog/pcook/ch17_04.htm ,它解释了如何完全按照我的需要做。

$pattern = '/<html[^>]*>([^\00]*?)<\/html>/i';

只有在内容中有0x00字节(不应该是0x00字节)的情况下,这才会中断。

you can use an html parser like : http://php-html.sourceforge.net/ 您可以使用html解析器,例如: http : //php-html.sourceforge.net/

or you can use strip_tags php.net/strip_tags 或者您可以使用strip_tags php.net/strip_tags

[.\\s] means either a literal . [.\\s]表示文字. or a whitespace character. 或空白字符。 What you need is either (.|\\s) , or [\\s\\S] , or you simply set the s modifier to have . 您需要的是(.|\\s)[\\s\\S] ,或者您只需将s修饰符设置为具有. also match line breaks. 还匹配换行符。

But besides that, you should not use regular expressions to match HTML . 但是除此之外, 您不应该使用正则表达式来匹配HTML Parts of HTML are not regular and thus you cannot use regular expressions to describe it. HTML的各个部分不是正则的,因此您不能使用正则表达式来描述它。

But besides that, you should not try to guess the range of a multipart content when you have distinct delimiters. 但是除此之外,当您有不同的定界符时,您不应尝试猜测多部分内容的范围。 But these aren't <html>…</html> . 但是这些不是<html>…</html> Because what if they are missing? 因为如果他们失踪了怎么办? Then your attempt will fail. 然后,您的尝试将失败。 Use the delimiters defined by the message itself: the boundary value. 使用消息本身定义的定界符: 边界值。 So use the boundary to get the parts and split them at the first CRLF+CRLF sequence to separate the header from the body. 因此,使用边界获取零件并在第一个CRLF + CRLF序列处将其拆分以将标头与主体分离。

But besides that, why don't you use the IMAP functions to get the body? 但是除此之外,为什么不使用IMAP函数获取正文呢? I'm not familiar with the PHP's IMAP API, but there probably is a function that does exactly that what you're looking for. 我对PHP的IMAP API不熟悉,但是可能有一个功能可以完全满足您的需求。

You just need to add s modifier to allow . 您只需要添加s修饰符即可. match newlines: 匹配换行符:

$pattern = '/<html[^>]*>(.*?)<\/html>/si';
preg_match($pattern, $body, $matches);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM