简体繁体中英

Conversion from HTML to XHTML changes euro symbol, preventing correct XML parsing

原文 2013-10-21 11:08:38 4 2 html/ xml/ parsing/ sax/ euro

I am extracting information from an HTML file, by parsing it using SAX, in Java. The parsing program was given to me, it was already using SAX, so I would like to keep it this way. What I do is the following :

I get the HTML file from a website
transform it into valid XML using the JTidy Library. However this library transforms all the € symbols into "â??¬" ---> I get fileXHTML
I feed the file XHTML to the parsing library, so I can extract the data I want (wrote the handlers, the function startElement(), characters() and endElement().

Problem: with that new string for the euro sign, the parsing library won't run. I get the message : " the entity acirc was referenced but not declared "

I just want my euro sign to not be a problem. How do I sort my thing out ?

Thanks everyone,

2 answers

The issue you are having is one of encoding.

Some tool, somewhere in your pipeline, is mucking up the encoding, and then that error is carried forwards, creating an â in your output.

From the looks of it, the web site is using UTF-8 (as well it should), but the encoding is either misdeclared, or the declaration is ignored.

Whether it is one of the tools in your toolchain that causes this problem, or if it's misuse of the tools, is not entirely clear.

使用HTML号代替实际的欧元符号

Html entities like € is not converted to its symbol in CSV conversion

XHTML to XML XSLT conversion

Tick symbol in HTML/XHTML

Preventing PHP from auto parsing XML

Parsing RDFa in html/xhtml?

HTML to XHTML conversion without HTML tag

 HTML entity for Euro currency symbol not visible in GMail

Transforming XML into HTML (as opposed to xhtml)

Convert wordml (xml) to XHTML/HTML

Compare/contrast HTML, XHTML, XML, and HTML5

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Html entities like € is not converted to its symbol in CSV conversion XHTML to XML XSLT conversion Tick symbol in HTML/XHTML Preventing PHP from auto parsing XML Parsing RDFa in html/xhtml? HTML to XHTML conversion without HTML tag  HTML entity for Euro currency symbol not visible in GMail Transforming XML into HTML (as opposed to xhtml) Convert wordml (xml) to XHTML/HTML Compare/contrast HTML, XHTML, XML, and HTML5

Related Tags

Conversion from HTML to XHTML changes euro symbol, preventing correct XML parsing

Question

2 answers

solution1
1 2013-10-21 11:25:09

solution2
0 2013-10-21 11:22:47

Conversion from HTML to XHTML changes euro symbol, preventing correct XML parsing

Question

2 answers

solution1 1 2013-10-21 11:25:09

solution2 0 2013-10-21 11:22:47

solution1
1 2013-10-21 11:25:09

solution2
0 2013-10-21 11:22:47