Monday 4 April 2011

What is the DOM Load and Save API (org.w3c.dom.ls) and what is it good for?

org.w3c.dom.ls is one of those enigmatic packages listed at the end of the standard Java 1.6 APIs.  The Javadoc has few clues as to what is for, and no examples showing you how to use it.  This is a shame because it is really useful if you want to use DOM to deal with very large XML documents.

The DOM Load and Save API provides filtering of DOM nodes on input and output. During loading, completed nodes can be inspected before they are included in the final document, and vetoed if they are unwanted.  Similarly nodes can be omitted when a document is output - but that is of less interest to us.


When we process large XML documents, we might use this mechanism to filter out most of the nodes and leave us with a smaller set.  However, we can put this mechanism to much better use, to process every item in a large feed.



As each item element is completed it is passed to our filter class to be inspected. The item element is a DOM fragment that can be processed using standard DOM tools, such as XSLT or XPath.  Afterwards the filter can reject the item and the memory the DOM fragment used will be freed.  In this way we can use DOM processing on very large XML documents without requiring huge amounts of memory.  We can also begin processing as soon as the first item has arrived, instead of having to wait until the whole document as been read.

Let's write some code:

This API is composed completely of interfaces that no documented API classes implement or return, and the Javadoc doesn't explain where they come from.  The missing piece of the jigsaw is the use of the DOMImplementationRegistry and the magic parameter you have to pass to the factory method.
DOMImplementationRegistry registry = DOMImplementationRegistry
        .newInstance();
DOMImplementationLS domImpl = (DOMImplementationLS) registry
        .getDOMImplementation("XML 1.0 LS 3.0");

Now you have a factory object from which to get implementations of the API interfaces.  The only one you need to implement yourself is LSParserFilter.  Let's create an general purpose implementation that passes target elements to another class to do whatever processing is required:
/**
 * An implementation of {@link LSParserFilter} to process
 * DOM element nodes.
 */
public class ProcessingFilter implements LSParserFilter {

    final private ElementProcessor elementProcessor;
    final private String targetElementName;
    final private String targetNamespace;

    private boolean withinTargetNode = false;
    private Exception processingException = null;

    /**
     * An implementation of {@link LSParserFilter} to process DOM element
     * nodes.
     *
     * @param elementProcessor
     *            the component that will process each target element.
     * @param targetElementName
     *            the local name of the target elements to process.
     * @param targetNamespace
     *            the namespace of the target elements to process.
     */
    public ProcessingFilter(ElementProcessor elementProcessor,
            String targetElementName, String targetNamespace) {
        this.elementProcessor = elementProcessor;
        this.targetElementName = targetElementName;
        this.targetNamespace = targetNamespace;
    }

    /**
     * An implementation of {@link LSParserFilter} to process DOM element
     * nodes.
     *
     * @param elementProcessor
     *            the component that will process each target element.
     * @param targetElementName
     *            the name of the target elements (in the default namespace) to
     *            process.
     */
    public ProcessingFilter(ElementProcessor elementProcessor,
            String targetElementName) {
        this(elementProcessor, targetElementName, null);
    }

    public int getWhatToShow() {
        return NodeFilter.SHOW_ALL;
    }

    public short startElement(Element element) {
        if (isTargetElement(element)) {
            withinTargetNode = true;
        }
        return FILTER_ACCEPT;
    }

    public short acceptNode(Node node) {
        /* When we get a completed target element, we process it. */
        if (isTargetElement(node)) {
            try {
                elementProcessor.process((Element) node);
            } catch (Exception ex) {
                this.processingException = ex;
                return FILTER_INTERRUPT;
            }
            withinTargetNode = false;
            return FILTER_REJECT;
        }
        return withinTargetNode ? FILTER_ACCEPT : FILTER_REJECT;
    }

    public Exception getProcessingException() {
        return this.processingException;
    }

    private boolean isTargetElement(final Node node) {
        return Node.ELEMENT_NODE == node.getNodeType()
                && targetElementName.equals(node.getLocalName())
                && (targetNamespace == null ?
                    node.getNamespaceURI() == null :
                    targetNamespace.equals(node.getNamespaceURI()));
    }
}

Then we put our actual processing in a class that implements this interface.
public interface ElementProcessor {
    void process(Element element) throws Exception;
}

To put these ideas into practice, let's print out the titles in an RSS feed using an xpath expression:
    /*
     * During application startup ...
     */
    DOMImplementationRegistry registry = DOMImplementationRegistry
            .newInstance();
    DOMImplementationLS domImpl = (DOMImplementationLS) registry
            .getDOMImplementation("XML 1.0 LS 3.0");

    /*
     *  The processing for each target element ...
     */
    ElementProcessor processor = new ElementProcessor(){
       
        private XPath xpath = XPathFactory.newInstance().newXPath();

        public void process(Element element) throws Exception {
            String title = (String) xpath.evaluate(
                    "./title", element, XPathConstants.STRING);
            System.out.println(title);
        }
    };
    ProcessingFilter filter = new ProcessingFilter(processor, "item");

    /*
     *  Create the parser and process the feed ...
     */
    LSParser parser = domImpl.createLSParser(
            DOMImplementationLS.MODE_SYNCHRONOUS, null);
    parser.setFilter(filter);
    LSInput input = domImpl.createLSInput();

    URL url = new URL("http://news.google.com/?output=rss");
    InputStream inputStream = new BufferedInputStream(url.openStream());
    try {
        input.setByteStream(inputStream);

        parser.parse(input);  // Returns almost empty document DOM

        Exception ex = filter.getProcessingException();
        if (ex != null){
            throw ex;
        }
    }
    finally {
        inputStream.close();
    }


No comments:

Post a Comment