[英]How to parse XMP in java when not valid XML?
我正在使用javax.imageio從PNG圖像中提取元數據。 這很好。 但是,獲取實際元數據的getAsTree方法返回無效的XML。 因此,我不知道如何解析此XML以獲取某些元數據:
run:
Format name: javax_imageio_png_1.0
<javax_imageio_png_1.0>
<IHDR width="256" height="256" bitDepth="8" colorType="RGBAlpha" compressionMethod="deflate" filterMethod="adaptive" interlaceMethod="none"/>
<cHRM whitePointX="31269" whitePointY="32899" redX="63999" redY="33001" greenX="30000" greenY="60000" blueX="15000" blueY="5999"/>
<gAMA value="45454"/>
<iTXt>
<iTXtEntry keyword="XML:com.adobe.xmp" compressionFlag="FALSE" compressionMethod="0" languageTag="" translatedKeyword="" text="<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.0-c061 64.140949, 2010/12/07-10:57:01 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#"
xmlns:lr="http://ns.adobe.com/lightroom/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmp:MetadataDate="2012-12-05T21:36:19+01:00"
xmpMM:InstanceID="xmp.iid:EF7F11740720681192B08F682498C71D"
xmpMM:DocumentID="xmp.did:FC7F11740720681192B0AE5890E66CAE"
xmpMM:OriginalDocumentID="xmp.did:FC7F11740720681192B0AE5890E66CAE">
<xmpMM:History>
<rdf:Seq>
<rdf:li
stEvt:action="saved"
stEvt:instanceID="xmp.iid:FC7F11740720681192B0AE5890E66CAE"
stEvt:when="2012-12-04T00:23:34+01:00"
stEvt:changed="/metadata"/>
<rdf:li
stEvt:action="saved"
stEvt:instanceID="xmp.iid:EF7F11740720681192B08F682498C71D"
stEvt:when="2012-12-05T21:36:19+01:00"
stEvt:changed="/metadata"/>
</rdf:Seq>
</xmpMM:History>
<lr:hierarchicalSubject>
<rdf:Bag>
<rdf:li>Component|Software</rdf:li>
<rdf:li>Places|Paris</rdf:li>
<rdf:li>Product|Christensen</rdf:li>
<rdf:li>Product|Simba</rdf:li>
</rdf:Bag>
</lr:hierarchicalSubject>
<dc:subject>
<rdf:Bag>
<rdf:li>Christensen</rdf:li>
<rdf:li>Paris</rdf:li>
<rdf:li>Simba</rdf:li>
<rdf:li>Software</rdf:li>
</rdf:Bag>
</dc:subject>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="r"?>"/>
</iTXt>
<pHYs pixelsPerUnitXAxis="2835" pixelsPerUnitYAxis="2835" unitSpecifier="meter"/>
</javax_imageio_png_1.0>
Format name: javax_imageio_1.0
<javax_imageio_1.0>
<Chroma>
<ColorSpaceType name="RGB"/>
<NumChannels value="4"/>
<Gamma value="0.45453998"/>
<BlackIsZero value="TRUE"/>
</Chroma>
<Compression>
<CompressionTypeName value="deflate"/>
<Lossless value="TRUE"/>
<NumProgressiveScans value="1"/>
</Compression>
<Data>
<PlanarConfiguration value="PixelInterleaved"/>
<SampleFormat value="UnsignedIntegral"/>
<BitsPerSample value="8 8 8 8"/>
</Data>
<Dimension>
<PixelAspectRatio value="1.0"/>
<ImageOrientation value="Normal"/>
<HorizontalPixelSize value="0.35273367"/>
<VerticalPixelSize value="0.35273367"/>
</Dimension>
<Text>
<TextEntry keyword="XML:com.adobe.xmp" value="<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.0-c061 64.140949, 2010/12/07-10:57:01 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#"
xmlns:lr="http://ns.adobe.com/lightroom/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmp:MetadataDate="2012-12-05T21:36:19+01:00"
xmpMM:InstanceID="xmp.iid:EF7F11740720681192B08F682498C71D"
xmpMM:DocumentID="xmp.did:FC7F11740720681192B0AE5890E66CAE"
xmpMM:OriginalDocumentID="xmp.did:FC7F11740720681192B0AE5890E66CAE">
<xmpMM:History>
<rdf:Seq>
<rdf:li
stEvt:action="saved"
stEvt:instanceID="xmp.iid:FC7F11740720681192B0AE5890E66CAE"
stEvt:when="2012-12-04T00:23:34+01:00"
stEvt:changed="/metadata"/>
<rdf:li
stEvt:action="saved"
stEvt:instanceID="xmp.iid:EF7F11740720681192B08F682498C71D"
stEvt:when="2012-12-05T21:36:19+01:00"
stEvt:changed="/metadata"/>
</rdf:Seq>
</xmpMM:History>
<lr:hierarchicalSubject>
<rdf:Bag>
<rdf:li>Component|Software</rdf:li>
<rdf:li>Places|Paris</rdf:li>
<rdf:li>Product|Christensen</rdf:li>
<rdf:li>Product|Simba</rdf:li>
</rdf:Bag>
</lr:hierarchicalSubject>
<dc:subject>
<rdf:Bag>
<rdf:li>Christensen</rdf:li>
<rdf:li>Paris</rdf:li>
<rdf:li>Simba</rdf:li>
<rdf:li>Software</rdf:li>
</rdf:Bag>
</dc:subject>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="r"?>" language="" compression="none"/>
</Text>
<Transparency>
<Alpha value="nonpremultipled"/>
</Transparency>
</javax_imageio_1.0>
BUILD SUCCESSFUL (total time: 3 seconds)
無效的XML從iTXtEntry元素開始,該元素具有xpacket位,並且包含子元素,盡管它具有自動關閉標簽格式而不是結束標簽。 因此,當我嘗試使用DOM文檔和xpath對此進行解析時,我收到一條錯誤消息,指出該元素的內容中不能包含“>”。
我在DocumentBuilderFactory上禁用了DTD驗證。 這沒有幫助。 我覺得我要使用正則表達式,但這似乎不對。 為什么首先要從imageio中的getAsTree方法獲取無效的XML,我該怎么辦?
您的問題很荒謬,因為IIOMetaData.getAsTree()
返回一個DOM Node對象,該對象是Node樹的根。 這是XML的內存表示形式。 它不會從任何地方進行解析,因此不會無效。 xml文檔字符串可能無效,但是這里沒有正在解析的字符串。 getAsTree
方法直接在內存中創建XML。
問題在於您的輸出產生無效的XML。 從getAsTree()
序列化Node的任何方法都是錯誤的。 即,它沒有正確地轉義本身就是XML文檔字符串的text
屬性的值。
下面是一個完整的示例,演示了如何獲取圖像元數據並將其序列化為(有效)XML字符串。
import java.io.*;
import java.util.*;
// for imageio metadata
import javax.imageio.*;
import javax.imageio.stream.*;
import javax.imageio.metadata.*;
// for xml handling
import org.w3c.dom.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;
public class imgmeta {
// Very lazy exception handling
// This is just a quick example
public static void main(String[] args) throws Exception {
String filename = args[0];
File file = new File(filename);
ImageInputStream imagestream = ImageIO.createImageInputStream(file);
// get a reader which is able to read this file
Iterator<ImageReader> readers = ImageIO.getImageReaders(imagestream);
ImageReader reader = readers.next();
// feed image to reader
reader.setInput(imagestream, true);
// get metadata of first image
IIOMetadata metadata = reader.getImageMetadata(0);
// get any metadata format name
// (you should prefer the native one, but not all images have one)
// String mdataname = metadata.getNativeMetadataFormatName(); // might be null
String[] mdatanames = metadata.getMetadataFormatNames();
String mdataname = mdatanames[0];
Node metadatadom = metadata.getAsTree(mdataname);
// metadatadom is now a DOM Node root of a DOM tree
// representing metadata in the image
// Since it's in-memory, it can't be "invalid"
// because it's already been parsed
// now let's serialize to an XML string
// javax.xml.transform.Transformer takes xml sources
// in one representation and transforms them to xml
// in another representation
// Representations include: DOM, JAXB, SAX, stream, etc
DOMSource source = new DOMSource(metadatadom);
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(source, result);
// THIS is what you want:
String metadata_in_xml = writer.toString();
// now print it:
System.out.print(metadata_in_xml);
}
}
這是使用我周圍的圖像運行的測試輸出:
$ java imgtest testimage.png | xmllint --format -
<?xml version="1.0" encoding="UTF-8"?>
<javax_imageio_png_1.0>
<IHDR width="149" height="237" bitDepth="8" colorType="RGBAlpha" compressionMethod="deflate" filterMethod="adaptive" interlaceMethod="none"/>
<iTXt>
<iTXtEntry keyword="XML:com.adobe.xmp" compressionFlag="0" compressionMethod="0" languageTag="" translatedKeyword="" text="<?xpacket begin="?" id="W5M0MpCehiHzreSzNTczkc9d"?> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.0-c061 64.140949, 2010/12/07-10:57:01 "> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/" xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/" xmlns:stRef="http://ns.adobe.com/xap/1.0/sType/ResourceRef#" xmp:CreatorTool="Adobe Photoshop CS5.1 Macintosh" xmpMM:InstanceID="xmp.iid:D281E43D34DC11E2BFE69DA1E5D17E5F" xmpMM:DocumentID="xmp.did:D281E43E34DC11E2BFE69DA1E5D17E5F"> <xmpMM:DerivedFrom stRef:instanceID="xmp.iid:D281E43B34DC11E2BFE69DA1E5D17E5F" stRef:documentID="xmp.did:D281E43C34DC11E2BFE69DA1E5D17E5F"/> </rdf:Description> </rdf:RDF> </x:xmpmeta> <?xpacket end="r"?>"/>
</iTXt>
<tEXt>
<tEXtEntry keyword="Software" value="Adobe ImageReady"/>
</tEXt>
</javax_imageio_png_1.0>
產生的XML有效:
$ java imgmeta testimage.png | xmllint --noout -
$
(沒有輸出表示有效。)
注意如何對iTXtEntry
的text
屬性的值進行轉義。 如果要檢索此屬性內的數據,則需要檢索字符串,然后將其解析為自己的XML文檔,以獲得另一個DOM(或任何其他形式)樹。 此屬性: keyword="XML:com.adobe.xmp"
表示text
屬性的值是其中包含XMP數據的XML文檔。
這是一些示例代碼,演示了提取屬性值並將其與XML和DOM樹進行解析。
public class XMPExample {
public static String transformXML(Node xml) throws Exception {
StringWriter writer = new StringWriter();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new DOMSource(xml), new StreamResult(writer));
return writer.toString();
}
public static Document transformXML(String xml) throws Exception {
StringReader reader = new StringReader(xml);
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new StreamSource(reader), new DOMResult(doc));
return doc;
}
public static String getXMP(Element metadata_dom) throws Exception {
// (Element) type because getElementsByTagName() method is required
// There are many more robust ways of selecting nodes
// (e.g. javax.xml.xpath), but this is for a simple example
// that only uses the native DOM methods
// This is very brittle because we're making assumptions about
// the metadata_dom structure. There are two sources of brittleness:
// 1. The metadata format from `metadata.getMetadataFormatNames()`.
// You should probably settle on a standard one you know will
// exist, like 'javax_imageio_1.0'
// 2. How the image stores the metadata. Usually XMP data will
// be in a text field with keyword 'XML:com.adobe.xmp', but
// I don't know that this is *always* the case.
// the code below assumes "javax_imageio_png_1.0" format
NodeList iTXtEntries = metadata_dom.getElementsByTagName("iTXtEntry");
Element iTXtEntry = null;
Element entry = null;
for (int i = 0; i < iTXtEntries.getLength(); i++) {
entry = (Element) iTXtEntries.item(i);
if (entry.getAttribute("keyword").equals("XML:com.adobe.xmp")) {
iTXtEntry = entry;
break;
}
}
if (iTXtEntry == null) {
return null;
}
String xmp_xml_doc = iTXtEntry.getAttribute("text");
return xmp_xml_doc;
}
}
// Use like so:
Node metadatanode = metadata.getAsTree(metadataname);
String xmp_xml = XMPExample.getXMP((Element) metadatanode);
// xmp_xml is now an xml document STRING
System.out.print(xmp_xml);
// If you want to parse it as an XML document, use an XML parser.
Document xmp_dom = XMPExample.transformXML(xmp_xml);
// ...and you can serialize it again when you are done.
String xmp_xml_roundtripped = XMPExample.transformXML(xmp_dom);
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.