简体   繁体   English

使用PHP将HTML页面转换为纯文本

[英]Convert HTML Page to Plain Text using PHP

To be more precise, i need to be able to Strip HTML Tags as nice as this script does: zubrag.com/tools/html-tags-stripper.php 更准确地说,我需要能够像此脚本一样剥离HTML标记:zubrag.com/tools/html-tags-stripper.php

I need to be able to do this on my localhost (xampp server) with any url, but for now i would like to use this url to strip tags from, as this is as messiest it can get: http://static.anaf.ro/static/10/Timis/Timis.htm 我需要能够在本地主机(xampp服务器)上使用任何URL进行此操作,但是现在我想使用此URL来从中剥离标签,因为这可能会使其变得最混乱http://static.anaf .RO /静态/ 10 /蒂米什/ Timis.htm

What I do have , doesn't work and I don't know why or how to fix it. 我拥有的东西不起作用,我也不知道为什么或如何解决。 Here's were the code comes from: nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page 代码来自:nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page

I've added this line to the code, but it still won't work ... 我已经在代码中添加了这一行,但是仍然行不通...

$text = file_get_contents('http://static.anaf.ro/static/10/Timis/Timis.htm');

Below is the raw code (note that the original code doesn't have the line from above. That line was added by me) 下面是原始代码(请注意,原始代码没有上面的行。该行是我添加的)

/**
 * Copyright (c) 2008, David R. Nadeau, NadeauSoftware.com.
 * All rights reserved.
 * See:
 *  http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page
 */


$text = file_get_contents('http://static.anaf.ro/static/10/Timis/Timis.htm');

function strip_html_tags( $text )
{
    // PHP's strip_tags() function will remove tags, but it
    // doesn't remove scripts, styles, and other unwanted
    // invisible text between tags.  Also, as a prelude to
    // tokenizing the text, we need to insure that when
    // block-level tags (such as <p> or <div>) are removed,
    // neighboring words aren't joined.
    $text = preg_replace(
        array(
            // Remove invisible content
            '@<head[^>]*?>.*?</head>@siu',
            '@<style[^>]*?>.*?</style>@siu',
            '@<script[^>]*?.*?</script>@siu',
            '@<object[^>]*?.*?</object>@siu',
            '@<embed[^>]*?.*?</embed>@siu',
            '@<applet[^>]*?.*?</applet>@siu',
            '@<noframes[^>]*?.*?</noframes>@siu',
            '@<noscript[^>]*?.*?</noscript>@siu',
            '@<noembed[^>]*?.*?</noembed>@siu',

            // Add line breaks before & after blocks
            '@<((br)|(hr))@iu',
            '@</?((address)|(blockquote)|(center)|(del))@iu',
            '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
            '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
            '@</?((table)|(th)|(td)|(caption))@iu',
            '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
            '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
            '@</?((frameset)|(frame)|(iframe))@iu',
        ),
        array(
            ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
            "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
            "\n\$0", "\n\$0",
        ),
        $text );

    // Remove all remaining tags and comments and return.
    echo strip_tags( $text );
}

It works fine, but the regex from the link in your post doesn't work. 它工作正常,但帖子中链接中的正则表达式不起作用。 It doesn't return the right charset, so try this: 它不会返回正确的字符集,因此请尝试以下操作:

function strip_html_tags( $text )
{
    $text = preg_replace(
        array(
          // Remove invisible content
            '@<head[^>]*?>.*?</head>@siu',
            '@<style[^>]*?>.*?</style>@siu',
            '@<script[^>]*?.*?</script>@siu',
            '@<object[^>]*?.*?</object>@siu',
            '@<embed[^>]*?.*?</embed>@siu',
            '@<applet[^>]*?.*?</applet>@siu',
            '@<noframes[^>]*?.*?</noframes>@siu',
            '@<noscript[^>]*?.*?</noscript>@siu',
            '@<noembed[^>]*?.*?</noembed>@siu',
          // Add line breaks before and after blocks
            '@</?((address)|(blockquote)|(center)|(del))@iu',
            '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
            '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
            '@</?((table)|(th)|(td)|(caption))@iu',
            '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
            '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
            '@</?((frameset)|(frame)|(iframe))@iu',
        ),
        array(
            ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
            "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
            "\n\$0", "\n\$0",
        ),
        $text );
    return strip_tags( $text );
}

/* Read an HTML file */
$raw_text =  file_get_contents('http://static.anaf.ro/static/10/Timis/Timis.htm');

/* Get the file's character encoding from a <meta> tag */
preg_match("/<meta[^>]+charset=['\"]?(.*?)['\"]?[\/\s>]/i", $raw_text, $matches );
$encoding = $matches[1];

/* Convert to UTF-8 before doing anything else */
$utf8_text = iconv( $encoding, "utf-8", $raw_text );

/* Strip HTML tags and invisible text */
$utf8_text = strip_html_tags( $utf8_text );

/* Decode HTML entities */
$utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "UTF-8" );
echo $utf8_text;

What I've changed: 我改变了什么:

To get the right charset I simply replaced this 为了获得正确的字符集,我简单地替换了这个

/* Get the file's character encoding from a <meta> tag */
preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i', $raw_text, $matches );
$encoding = $matches[3];

with this 有了这个

preg_match("/<meta[^>]+charset=['\"]?(.*?)['\"]?[\/\s>]/i", $raw_text, $matches );
$encoding = $matches[1];

EDIT 1: Guess the script from the website does have some problems with stripping the tags from the URL you provided. 编辑1:从网站上猜脚本确实有一些问题,从您提供的URL剥离标签。 It shows a lot of Â's. 它显示了很多。 I guess the best way to strip tags is simply strip everything between an opening < and the first closing >. 我想剥离标签的最佳方法就是剥离开头<和第一个结束>之间的所有内容。 But I don't have any idea for a regex at the moment, maybe google could help :) 但目前我对正则表达式一无所知,也许谷歌可以帮助:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM