简体   繁体   English

使用正则表达式修剪html

[英]Using regular expression to trim html

Been trying to solve this for a while now. 一直在尝试解决这一问题。

I need a regex to strip the newlines, tabs and spaces between the html tags demonstrated in the example below: 我需要一个正则表达式来删除以下示例中演示的html标记之间的换行符,制表符和空格:

Source: 资源:

<html>
   <head>
     <title>
           Some title
       </title>
    </head>
</html>

Wanted result: 想要的结果:

<html><head><title>Some title</title></head></html>

The trimming of the whitespaces before the "Some title" is optional. 在“某些标题”之前修剪空格是可选的。 I'd be grateful for any help 我将不胜感激

If the HTML is strict, load it with an XML reader and write it back without formatting. 如果HTML是严格的,请使用XML阅读器加载它并写回而不格式化。 That will preserve the whitespace within tags, but not between them. 这将保留标记内的空格,但不会保留它们之间的空格。

\\d does not match only [0-9] in Perl 5.8 and 5.10; \\ d仅与Perl 5.8和5.10中的[0-9]不匹配; it matches any UNICODE character that has the digit attribute (including "\\x{1815}" and "\\x{FF15}"). 它与任何具有digit属性的UNICODE字符匹配(包括“ \\ x {1815}”和“ \\ x {FF15}”)。 If you mean [0-9] you must either use [0-9] or use the bytes pragma (but it turns all strings in 1 byte characters and is normally not what you want). 如果您表示[0-9],则必须使用[0-9]或使用字节编译指示(但是它将所有字符串转换为1字节字符,通常不是您想要的)。

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). 正则表达式从根本上不利于解析HTML(请参阅您能否提供一些示例,以了解为什么很难用正则表达式来解析XML和HTML? )。 What you need is an HTML parser. 您需要一个HTML解析器。 See Can you provide an example of parsing HTML with your favorite parser? 请参见您能否提供一个使用您喜欢的解析器解析HTML的示例? for examples using a variety of parsers. 例如使用各种解析器的示例。

You may find the HTMLAgilityPack answer helpful. 您可能会发现HTMLAgilityPack答案很有帮助。

A solution with XSLT would look like this: XSLT的解决方案如下所示:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">    
<xsl:output  method="xml" encoding="UTF-8" indent="no"/>

<xsl:template match="*|@*">
    <xsl:copy>
        <xsl:apply-templates/>
    </xsl:copy>
</xsl:template>

<!-- trim whitespaces from the content -->
<xsl:template match="text()">
    <!-- remove from tag to content -->
    <xsl:variable name="trimmedHead" select="replace(.,'^\s+','')"/>
    <xsl:variable name="trimmed" select="replace($trimmedHead,'\s+$','')"/>
    <xsl:value-of select="$trimmed"/>
</xsl:template>

<!-- do not trim where text content exist -->
<xsl:template match="text()">
    <xsl:if test="not(matches(.,'^\s+$'))">
        <xsl:value-of select="."/>
    </xsl:if>
</xsl:template>

You can choose the template you would like to use. 您可以选择要使用的模板。 The first removes all whitespaces also when content exists, and the second one removes only when there are just whitespaces or newlines. 当内容存在时,第一个也删除所有空格,而第二个仅在只有空格或换行符时删除。

Regex.Replace(input, "<[^>]*>", String.Empty);

尝试这个:

s/[^\w\/\d<>]+/gs

s/>\\s+</></gs

s/\\s*(<[^>]+>)\\s*/\\1/gs

或者,在C#中:

Regex.Replace(html, "\\s*(<[^>]+>)\\s*", "$1", RegexOptions.SingleLine);

这将删除标签之间的空白以及标签和文本之间的空间。

s/(\s*(<))|((>)\s*)/\2\4/g

I wanted to preserve the new lines, since the removal of newlines was messing up my html. 我想保留新行,因为删除新行会弄乱我的html。 So I went with the following. 因此,我接受了以下内容。 .

private static string ProcessHTMLFile(string input)
{
    string opt = Regex.Replace(input, @"(  )*", "", RegexOptions.Singleline);
    opt = Regex.Replace(opt, @"[\t]*", "", RegexOptions.Singleline);
    return opt;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM