简体   繁体   English

使用Java删除XML标记内的空格

[英]remove whitespaces inside XML tag with java

I am getting XML with the following tags. 我正在使用以下标签获取XML。 What I do is, read the XML file with Java using Sax parser and save them to database. 我要做的是,使用Sax解析器使用Java读取XML文件并将其保存到数据库。 but it seems that spaces are there after the p tag like below. 但似乎在p标记后有空格,如下所示。

     <Inclusions><![CDATA[<p>                                               </p><ul> <li>Small group walking tour</li> <li>Entrance fees</li> <li>Professional guide </li> <li>Guaranteed to skip the long lines</li> <li>Headsets to hear the guide clearly</li> </ul>
                <p></p>]]></Inclusions>

But when we insert the read string to the database(PostgreSQL 8) it is printing bad charactors like below for those spaces. 但是,当我们将读取的字符串插入数据库(PostgreSQL 8)时,它会为这些空格打印如下所示的不良字符。

\\011\\011\\011\\011\\011\\011\\011\\011\\011\\011\\011\\011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011

  • Small group walking tour 小团体徒步之旅
  • Entrance fees 入场费
  • Professional guide 专业指导
  • Guaranteed to skip the long lines 保证跳过长行
  • Headsets to hear the guide clearly 耳机清晰听指南
\\012\\011\\011\\011\\011\\011 \\ 012 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011

  1. I want to know why it is printing bad characters(011\\011) like that ? 我想知道为什么打印这样的坏字符(011 \\ 011)吗?

  2. What is the best way to remove spaces inside XML tags with java? 用java删除XML标记内的空格的最佳方法是什么? (Or how to prevent those bad characters.) (或者如何防止这些不良字符。)

I have checked samples and most of them with python samples. 我已经检查了样本,其中大多数都使用python样本。

This is how the XML reads with SAX in my program, 这就是XML在我的程序中与SAX一起读取的方式,

Method 1 方法1

  // ResultHandler is the class that used to read the XML. 
  ResultHandler handler         = new ResultHandler();
   // Use the default parser
  SAXParserFactory factory = SAXParserFactory.newInstance();
    // Retrieve the XML file
    FileInputStream in = new FileInputStream(new File(inputFile)); // input file is XML.
    // Parse the XML input
    SAXParser saxParser = factory.newSAXParser();
    saxParser.parse( in , handler);

This is how the ResultHandler class used to read the XML as Sax parser with Method-1 这是ResultHandler类用于使用Method-1读取XML作为Sax解析器的方式

import org.apache.log4j.Logger;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

// other imports

    class ResultHandler extends DefaultHandler {

        public void startDocument ()
        {
            logger.debug("Start document");         
        }

        public void endDocument ()
        {
            logger.debug("End document");
        }

        public void startElement(String namespaceURI, String localName, String qName, Attributes attribs)
        throws SAXException {           
            strValue = "";      
            // add logic with start of tag. 
        }

        public void characters(char[] ch, int start, int length)
        throws SAXException {
            //logger.debug("characters");
            strValue += new String(ch, start, length);
            //logger.debug("strValue-->"+strValue);
        }

        public void endElement(String namespaceURI, String localName, String qName)
        throws SAXException {           
            // add logic to end of tag. 
        }
    }

So that need to know, how to set setIgnoringElementContentWhitespace(true) or similar with sax parser. 因此,需要知道如何使用sax解析器设置setIgnoringElementContentWhitespace(true)或类似值。

You can try to set for your DocumentBuilderFactory 您可以尝试为您的DocumentBuilderFactory设置

setIgnoringElementContentWhitespace(true)

because of this: 因为这:

Due to reliance on the content model this setting requires the parser to be in validating mode 由于依赖于内容模型,因此此设置要求解析器处于验证模式

you also need to set 您还需要设置

setValidating(true)

Or the str= str.replaceAll("\\\\s+", ""); str= str.replaceAll("\\\\s+", ""); might as well work 可能会工作

  1. I'm also finding an exact answer. 我也在寻找确切的答案。 But think this will help for u. 但是认为这对您有帮助。
    The C/Modula-3 octal notation; C / Modula-3八进制符号; vs there meaning in this link vs there在此链接中的含义
    It says 它说

    • \\011 is for Horizontal tab (ASCII HT) \\ 011用于水平制表符(ASCII HT)
    • \\012 is for Line feed (ASCII NL, newline) \\ 012用于换行(ASCII NL,换行符)
  2. You can replace multiple spaces with one space as follows 您可以按以下方式用一个空格替换多个空格

    str = str.replaceAll("\\s([\\s])+", " "); str = str.replaceAll(“ \\ s([\\ s])+”,“”);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM