使用Java删除XML标记内的空格

Question

I am getting XML with the following tags. 我正在使用以下标签获取XML。 What I do is, read the XML file with Java using Sax parser and save them to database. 我要做的是，使用Sax解析器使用Java读取XML文件并将其保存到数据库。 but it seems that spaces are there after the p tag like below. 但似乎在p标记后有空格，如下所示。

     <Inclusions><![CDATA[<p>                                               </p><ul> <li>Small group walking tour</li> <li>Entrance fees</li> <li>Professional guide </li> <li>Guaranteed to skip the long lines</li> <li>Headsets to hear the guide clearly</li> </ul>
                <p></p>]]></Inclusions>

But when we insert the read string to the database(PostgreSQL 8) it is printing bad charactors like below for those spaces. 但是，当我们将读取的字符串插入数据库（PostgreSQL 8）时，它会为这些空格打印如下所示的不良字符。

\\011\\011\\011\\011\\011\\011\\011\\011\\011\\011\\011\\011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011

Small group walking tour 小团体徒步之旅

Entrance fees 入场费

Professional guide 专业指导

Guaranteed to skip the long lines 保证跳过长行

Headsets to hear the guide clearly 耳机清晰听指南

\\012\\011\\011\\011\\011\\011 \\ 012 \\ 011 \\ 011 \\ 011 \\ 011 \\ 011

I want to know why it is printing bad characters(011\\011) like that ? 我想知道为什么打印这样的坏字符（011 \\ 011）吗？
What is the best way to remove spaces inside XML tags with java? 用java删除XML标记内的空格的最佳方法是什么？ (Or how to prevent those bad characters.) （或者如何防止这些不良字符。）

I have checked samples and most of them with python samples. 我已经检查了样本，其中大多数都使用python样本。

This is how the XML reads with SAX in my program, 这就是XML在我的程序中与SAX一起读取的方式，

Method 1 方法1

  // ResultHandler is the class that used to read the XML. 
  ResultHandler handler         = new ResultHandler();
   // Use the default parser
  SAXParserFactory factory = SAXParserFactory.newInstance();
    // Retrieve the XML file
    FileInputStream in = new FileInputStream(new File(inputFile)); // input file is XML.
    // Parse the XML input
    SAXParser saxParser = factory.newSAXParser();
    saxParser.parse( in , handler);

This is how the ResultHandler class used to read the XML as Sax parser with Method-1 这是ResultHandler类用于使用Method-1读取XML作为Sax解析器的方式

import org.apache.log4j.Logger;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

// other imports

    class ResultHandler extends DefaultHandler {

        public void startDocument ()
        {
            logger.debug("Start document");         
        }

        public void endDocument ()
        {
            logger.debug("End document");
        }

        public void startElement(String namespaceURI, String localName, String qName, Attributes attribs)
        throws SAXException {           
            strValue = "";      
            // add logic with start of tag. 
        }

        public void characters(char[] ch, int start, int length)
        throws SAXException {
            //logger.debug("characters");
            strValue += new String(ch, start, length);
            //logger.debug("strValue-->"+strValue);
        }

        public void endElement(String namespaceURI, String localName, String qName)
        throws SAXException {           
            // add logic to end of tag. 
        }
    }

So that need to know, how to set setIgnoringElementContentWhitespace(true) or similar with sax parser. 因此，需要知道如何使用sax解析器设置setIgnoringElementContentWhitespace（true）或类似值。

Answer 1

You can try to set for your DocumentBuilderFactory 您可以尝试为您的DocumentBuilderFactory设置

setIgnoringElementContentWhitespace(true)

because of this: 因为这：

Due to reliance on the content model this setting requires the parser to be in validating mode 由于依赖于内容模型，因此此设置要求解析器处于验证模式

you also need to set 您还需要设置

setValidating(true)

Or the str= str.replaceAll("\\\\s+", ""); 或str= str.replaceAll("\\\\s+", ""); might as well work 可能会工作

Answer 2

I'm also finding an exact answer. 我也在寻找确切的答案。 But think this will help for u. 但是认为这对您有帮助。
The C/Modula-3 octal notation; C / Modula-3八进制符号； vs there meaning in this link vs there在此链接中的含义
It says 它说
- \\011 is for Horizontal tab (ASCII HT) \\ 011用于水平制表符（ASCII HT）
- \\012 is for Line feed (ASCII NL, newline) \\ 012用于换行（ASCII NL，换行符）
You can replace multiple spaces with one space as follows 您可以按以下方式用一个空格替换多个空格
str = str.replaceAll("\\s([\\s])+", " "); str = str.replaceAll（“ \\ s（[\\ s]）+”，“”）;

使用Java删除XML标记内的空格

问题描述

2 个解决方案

解决方案1
4 已采纳 2012-04-23 08:43:29

解决方案2
1 2012-04-23 08:30:37

使用Java删除XML标记内的空格

问题描述

2 个解决方案

解决方案1 4 已采纳 2012-04-23 08:43:29

解决方案2 1 2012-04-23 08:30:37

解决方案1
4 已采纳 2012-04-23 08:43:29

解决方案2
1 2012-04-23 08:30:37