Canonicalize XML in Java

XML canonicalization is often used when there is need to create digital signature to be sent to peers for verification. Since digital signature is created based on XML data, the XML data has to be canonicalized before its signature value can be calculated. Even an extra space may affect the signature value calculated, hence it must follow some rules to canonicalize the XML data so that it has a standard format. This is why W3C created specification Canonical XML Version 1.1.

This specification provides the rules to format element nodes, attribute nodes and namespace nodes etc. Different programming languages have implemented this specification so that we can use them to canonicalize XML data easily without knowing details about the specification.

In this tutorial, we will not go into the details into the specification, we will only focus on how to call Java API to canonicalize XML data. Based on XPath data model, a XML Document is represented by a set of nodes -- node set. While the XML API in Java receives a Data object which is to be canonicalized. To demonstrate the canonicalization, we will create a customized NodeSetDataImpl which implements NodeSetData and Iterator interfaces.

import java.util.Iterator;
import javax.xml.crypto.NodeSetData;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.traversal.DocumentTraversal;
import org.w3c.dom.traversal.NodeFilter;
import org.w3c.dom.traversal.NodeIterator;

public class NodeSetDataImpl implements NodeSetData, Iterator {
	private Node ivNode;
	private NodeFilter ivNodeFilter;
	private Document ivDocument;
	private DocumentTraversal ivDocumentTraversal;
	private NodeIterator ivNodeIterator;
	private Node ivNextNode;

	public NodeSetDataImpl(Node pNode, NodeFilter pNodeFilter) throws Exception {
		ivNode = pNode;
		ivNodeFilter = pNodeFilter;

		if (ivNode instanceof Document) {
			ivDocument = (Document) ivNode;
		} else {
			ivDocument = ivNode.getOwnerDocument();
		}

		ivDocumentTraversal = (DocumentTraversal) ivDocument;
	}

	private NodeSetDataImpl(NodeIterator pNodeIterator) {
		ivNodeIterator = pNodeIterator;
	}

	public Iterator iterator() {
		NodeIterator nodeIterator = ivDocumentTraversal.createNodeIterator(ivNode, NodeFilter.SHOW_ALL, ivNodeFilter, false);
		return new NodeSetDataImpl(nodeIterator);
	}

	private Node checkNextNode() {
		if (ivNextNode == null && ivNodeIterator != null) {
			ivNextNode = ivNodeIterator.nextNode();
			if (ivNextNode == null) {
				ivNodeIterator.detach();
				ivNodeIterator = null;
			}
		}
		return ivNextNode;
	}

	private Node consumeNextNode() {
		Node nextNode = checkNextNode();
		ivNextNode = null;
		return nextNode;
	}

	public boolean hasNext() {
		return checkNextNode() != null;
	}

	public Node next() {
		return consumeNextNode();
	}

	public void remove() {
		throw new UnsupportedOperationException("Removing nodes is not supported.");
	}

	public static NodeFilter getRootNodeFilter() {
		return new NodeFilter() {
			public short acceptNode(Node pNode) {
				if (pNode instanceof Element && pNode.getParentNode() instanceof Document) {
					return NodeFilter.FILTER_SKIP;
				}
				return NodeFilter.FILTER_ACCEPT;
			}
		};
	}
}

Then we need to create a CanonicalizationMethod to conduct the actual canonicalization. There are a few choices here:

CanonicalizationMethod.INCLUSIVE
CanonicalizationMethod.INCLUSIVE_WITH_COMMENTS
CanonicalizationMethod.EXCLUSIVE
CanonicalizationMethod.EXCLUSIVE_WITH_COMMENTS

In this tutorial, we will only demonstrate the inclusive canonicalization. Below is the code to do this.

import java.io.ByteArrayInputStream;
import javax.xml.crypto.Data;
import javax.xml.crypto.OctetStreamData;
import javax.xml.crypto.dsig.CanonicalizationMethod;
import javax.xml.crypto.dsig.XMLSignatureFactory;
import javax.xml.crypto.dsig.spec.C14NMethodParameterSpec;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;

public class CanonicalizationTest {
	private static String INPUT = 
			"" +
			"hello" +
			"world" +
			"";
	
	public static void main(String[] args){
		transform(INPUT);
	}
	
	/**
	 * Create new document with data
	 * 
	 * @param xml
	 * @return
	 */
	private static Document createNewDocument(String xml){
		try{
			byte[] bytes = xml.getBytes("UTF-8");
			ByteArrayInputStream bin = new ByteArrayInputStream(bytes);
			
			DocumentBuilderFactory fac = DocumentBuilderFactory.newInstance();
			fac.setNamespaceAware(true);
			
			DocumentBuilder docBuilder = fac.newDocumentBuilder();
			
			Document doc = docBuilder.parse(bin);

			return doc;
		} catch (Exception ex){
			ex.printStackTrace();
		}
		return null;
	}
	
	/**
	 * Transform a XML string
	 * 
	 * @param xml
	 */
	private static void transform(String xml){
		Document doc = createNewDocument(xml);
		
		try{
			Data data = new NodeSetDataImpl(doc, NodeSetDataImpl.getRootNodeFilter());
			XMLSignatureFactory fac = XMLSignatureFactory.getInstance("DOM");
			CanonicalizationMethod canonicalizationMethod = fac.newCanonicalizationMethod(CanonicalizationMethod.INCLUSIVE, (C14NMethodParameterSpec)null);
			// Doing the actual canonicalization
			OctetStreamData transformedData =(OctetStreamData) canonicalizationMethod.transform(data, null);
			byte[] bytes = tool.Util.readStream(transformedData.getOctetStream());
			String str = new String(bytes);
			System.out.println(str);
		} catch (Exception ex){
			ex.printStackTrace();
		}
	}
}

First let's look at the output :

<my:Node xmlns:my="http://example.com" Id="1">hello</my:Node><my:Node xmlns:my="http://example.com" Id="2">world</my:Node>

The test will first create a Document and then this Document will be passed to NodeSetDataImpl to create the node set to be canonicalized. From the output, we can see that the Root node is removed in the canonicalized data, this is because the NodeFilter in NodeSetDataImpl has filtered this node. Next, the second my:Node has the xmlns:my node before Id node in the canonicalized form. This is based on the Canonical XML specification where the nodes should be in lexical order.

For other canonicalization methods, the logic is the same here. The difference is that the output is not the same because different canonicalization methods have different rules.

Canonicalize XML in Java

RELATED

0 COMMENT

ABOUT

HOW IT WORKS

FOLLOW US

FEEDBACK