简体   繁体   中英

Conversion from HTML to XHTML changes euro symbol, preventing correct XML parsing

I am extracting information from an HTML file, by parsing it using SAX, in Java. The parsing program was given to me, it was already using SAX, so I would like to keep it this way. What I do is the following :

  • I get the HTML file from a website
  • transform it into valid XML using the JTidy Library. However this library transforms all the € symbols into "â??¬" ---> I get fileXHTML
  • I feed the file XHTML to the parsing library, so I can extract the data I want (wrote the handlers, the function startElement(), characters() and endElement().

Problem: with that new string for the euro sign, the parsing library won't run. I get the message : " the entity acirc was referenced but not declared "

I just want my euro sign to not be a problem. How do I sort my thing out ?

Thanks everyone,

The issue you are having is one of encoding.

Some tool, somewhere in your pipeline, is mucking up the encoding, and then that error is carried forwards, creating an â in your output.

From the looks of it, the web site is using UTF-8 (as well it should), but the encoding is either misdeclared, or the declaration is ignored.

Whether it is one of the tools in your toolchain that causes this problem, or if it's misuse of the tools, is not entirely clear.

使用HTML号代替实际的欧元符号€

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM