Xml and XPath in Scala

Scala has some really nice language support for Xml, which allows you to mix xml and Scala code. Also, you can do that without having to do a lot of escaping, e.g. when you are concatenating some xml into a String with Java code, you have to escape quotation marks in xml attributes, but that is not necessary with Scala code.
For example, you can write Scala code like this:

  def helloXml() {
    val myString = "Hello World"
    val someXml =
      <myRoot someAttribute="foo">
        <mySubElement anotherAttribute="bar">{myString}</mySubElement>
      </myRoot>

    println("Xml class: " + someXml.getClass().getName())
    println(new PrettyPrinter(50,2).format(someXml)) // outputting the xml as example below
  }

The above line with the following code:
val someXml = ...
could optionally have been written as:
val someXml: scala.xml.Elem = ...
but the typing is optional (as it usually is in Scala, i.e. in most cases the type can be inferred automatically).
Anyway, with or without the explicit typing, when the above method is invoked, it will produce output like below:

	Xml class: scala.xml.Elem
	<myRoot someAttribute="foo">
	  <mySubElement anotherAttribute="bar">
	    Hello World
	  </mySubElement>
	</myRoot>

Now you might think that this is no big deal and nothing to be excited above, since you think you very rarely hardcode xml as in the example above.

Well, the native xml support in Scala does actually play quite a big role in the web application framework Lift.

Regarding Lift, it is probaly fair to say that Lift is the same kind of killer app for Scala, as Ruby on Rails is for Ruby.

Now, the rest of this page below will focus on how to use XPath with Scala.
Three kind of things will be illustrated:

The example code below will fetch some book title from a "bookstore" xml document. (and by the way, the used xml structure is based on the one defined at http://www.w3schools.com/xpath/xpath_examples.asp)

The fetchings of some book titles will be done in two ways, with two implementations of the same abstract trait (which in the context of this example can be seen as equivalent of a Java interface).

One of the implementations will fetch the book titles by using a native method in the class 'scala.xml.Elem' e.g. the method named \

The other implementation will fetch the book titles by "extending" (kind of) the above class (scala.xml.Elem) with a new method (receiving an XPath expression as String parameter).

It will do so through an implicit conversion into a class that implements XPath support, by delegating to a Xml Facade class which uses some standard Java interfaces/classes for implementing the XPath support.

The xml will look something like this:

    <bookstore>
      <book category="COOKING">
        <title lang="en">Everyday Italian</title>
        ...
      </book>
      <book category="CHILDREN">
        <title lang="en">Harry Potter</title>
        ...
      </book>
      ...
    </bookstore>

Then there will be two 'BookstoreInformationRetriever' implementations (as mentioned above, one using native Scala code, and the other will be using an Xml Facade using Java XPath) which will be used in the following kind of assertions from test code:

    assertEquals("Everyday Italian",	bookstoreInformationRetriever.getBookTitleFromBookstoreXml(bookstoreXml, 1)
    assertEquals("Harry Potter",	bookstoreInformationRetriever.getBookTitleFromBookstoreXml(bookstoreXml, 2)

Now it is time to show some more code details, and first some JUnit test code is below illustrating what the client code looks like.
When it is executed, it also verifies that the code works, and that both of the 'BookstoreInformationRetriever' implementations work in the same way, as documented in the abstract trait, e.g. will return null if a title can not be found.

import org.junit.Assert.assertEquals
import org.junit.Test
import xml.PrettyPrinter
class BookstoreInformationRetrieverTest {

  private val bookstoreXml: scala.xml.Elem =
    // The type declaration (e.g. above ": scala.xml.Elem") is usually optional in Scala and could indeed be inferred here,
    // but is used here to illustrate what kind of object you will get when writing Xml in code as below.
    // Please note the special support for Xml in Scala, and if you would try to do something similar
    // in Java, i.e. providing an Xml string to some constructor, you would have to do a lot of
    // escaping of the quotation marks with a backslash, but that is not needed in the Scala code below as you can see.   
    <bookstore>

      <book category="COOKING">
        <title lang="en">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>30.00</price>
      </book>

      <book category="CHILDREN">
        <title lang="en">Harry Potter</title>
        <author>J K. Rowling</author>
        <year>2005</year>
        <price>29.99</price>
      </book>

      <book category="WEB">
        <title lang="en">XQuery Kick Start</title>
        <author>James McGovern</author>
        <author>Per Bothner</author>
        <author>Kurt Cagle</author>
        <author>James Linn</author>
        <author>Vaidyanathan Nagarajan</author>
        <year>2003</year>
        <price>49.99</price>
      </book>

      <book category="WEB">
        <title lang="en">Learning XML</title>
        <author>Erik T. Ray</author>
        <year>2003</year>
        <price>39.95</price>
      </book>

    </bookstore>

  @Test
  def verifyThatBookTitlesCanBeRetrievedFromBookstoreXml() {
    // verifies that both implementations behave in the same way
    verifyThatBookTitlesCanBeRetrievedFromBookstoreXml(new BookstoreInformationRetrieverImplementedWithNativeScala)
    verifyThatBookTitlesCanBeRetrievedFromBookstoreXml(new BookstoreInformationRetrieverImplementedByUsingJavaXPathCode)
  }

  private def verifyThatBookTitlesCanBeRetrievedFromBookstoreXml(
    bookstoreInformationRetriever: BookstoreInformationRetriever // the abstract trait with two implementations
  ) {
    verifyBookTitle("Everyday Italian",  1, bookstoreInformationRetriever)
    verifyBookTitle("Harry Potter",      2, bookstoreInformationRetriever)
    verifyBookTitle("XQuery Kick Start", 3, bookstoreInformationRetriever)
    verifyBookTitle("Learning XML",      4, bookstoreInformationRetriever)

    verifyBookTitle(null,                5, bookstoreInformationRetriever)
    verifyBookTitle(null,                0, bookstoreInformationRetriever)
    verifyBookTitle(null,               -1, bookstoreInformationRetriever)
  }
  
  private def verifyBookTitle(
    expectedTitle: String,
    bookItemNumber: Int,
    bookstoreInformationRetriever: BookstoreInformationRetriever
  ) {
    assertEquals(
      "Failing bookItemNumber: " + bookItemNumber + " for implementation: " + bookstoreInformationRetriever.getClass,
      expectedTitle,
      bookstoreInformationRetriever.getBookTitleFromBookstoreXml(bookstoreXml, bookItemNumber)
    )
  }
}

Below is the definition of the abstract trait (which you in this example can think of as a Java interface)

	/**
	 * In the context of this example, this Scala trait can be considered as equivalent to a Java interface.
	 * In other words, you can think of the below trait as the following java interface:
	 * interface BookstoreInformationRetriever {
	 *   String getBookTitleFromBookstoreXml(Elem xmlWithBooks, Int bookItemNumber);
	 * }
	 */
	trait BookstoreInformationRetriever {

	  /**
	   * @param xmlWithBooks an xml structure with a root element that should contain "book" sub-elements,
	   *  and each such "book" element should contain "title" sub-elements
	   * @param bookItemNumber For example, if the integer 2 is used as value here,
	   *  then the "title" of the second "book" should be returned.
	   *  In other words, this index parameter is "one-based" (as in XPath expressions)
	   *  and _not_ zero-based as lists often are in many programming languages (including Java and Scala)   
	   * @return the content of the "title" element within the "book" element (within the xml 'xmlWithBooks')
	   *  which is specified by the bookItemNumber parameter.
	   *  Null should be returned if the specified bookItem could not be found within the xml.  
	   */
	  def getBookTitleFromBookstoreXml(xmlWithBooks: Elem, bookItemNumber: Int): String
	}

Below is one of the implementations of the above trait, i.e. the implementation which only uses the Scala core libraries.
Note that the backslash used below is a method invocation (but the dots are not needed between the object and the method) i.e. the code 'xmlWithBooks \ "book"' can be read as 'xmlWithBooks.\("book")'.
It returns a 'NodeSeq', and then you can directly apply the method again to the result, i.e. invoke the method '\.("title")'.

In other words, the following code:
val nodeSequenceWithTitles = xmlWithBooks \ "book" \ "title"
might instead be written as:
val nodeSequenceWithTitles = xmlWithBooks.\("book").\("title")
(in the above rows, the optional type 'NodeSeq' is not used explicitly, but it is used in the code example below)

import xml.{Node, NodeSeq, Elem}
class BookstoreInformationRetrieverImplementedWithNativeScala extends BookstoreInformationRetriever {

  /**
   * @see the trait BookstoreInformationRetriever, which specifies the desired behaviour of this method
   */  
  def getBookTitleFromBookstoreXml(xmlWithBooks: Elem, bookItemNumber: Int): String = {
    // The type declarations below are optional (as they usually are with Scala)
    // but are used below with the purpose of making it more obvious for you (i.e. the code reader)  
    // in which classes to look, if you want to figure out where the methods are defined
    // in the Scala library API at http://www.scala-lang.org/docu/files/api/index.html  
    val nodeSequenceWithTitles: NodeSeq = xmlWithBooks \ "book" \ "title"
    val nodesWithTitles: List[Node] = nodeSequenceWithTitles.toList
    if( !( 1 <= bookItemNumber && bookItemNumber <= nodesWithTitles.size) ) return null // return null as documented in the trait
    val node: Node = nodesWithTitles(bookItemNumber-1) // the Scala List is zero-based while the item number is one-based as documented in the trait
    node.text
  }

  // below is an alternative implementation to the above method
//  def getBookTitleFromBookstoreXml(xmlWithBooks: Elem, bookItemNumber: Int): String = {
//    var count = 0
//    val nodeSequenceWithTitles = xmlWithBooks \ "book" \ "title" filter (
//      _ => {
//        count += 1
//        bookItemNumber == count // the filter will at most return one node, i.e. the node where this row evaluates to true
//      }
//    )
//    if (nodeSequenceWithTitles.size != 1) return null
//    nodeSequenceWithTitles.toList(0).text
//  }  
}

Below is the alternative implementation of the trait, which is not only using XPath through some Java interfaces/classes (which is not obviously visible in the code below but will be seen further down), but it does so in a way that makes it looks like the xpath method 'getNodeListMatchingXPathExpression' is defined in the core Scala class 'scala.xml.Elem', which it is _not_ but thanks to some usage of implicit conversion, the code can indeed be written as below.

// Important import below ! (without it, the method 'getNodeListMatchingXPathExpression'
// will not be available from the instance of the class 'scala.xml.Elem'
// i.e. that method is not included in the class, but can be used as if it is,
// thanks to the implicit conversion defined in the class imported below ! 
import xmlXpathExtension.XmlExtensionWithSomeXPathSupport.scalaXmlElementToXmlExtensionWithSomeXPathSupport

class BookstoreInformationRetrieverImplementedByUsingJavaXPathCode extends BookstoreInformationRetriever {
  /**
   * @see the trait BookstoreInformationRetriever, which specifies the desired behaviour of this method 
   */
  def getBookTitleFromBookstoreXml(
    // a complete type name is intentionally used here (see comment in code below) below to
    // emphasize that the 'Elem' is an existing class in the core library (which is kind of 'extended' below)
    xmlWithBooks: scala.xml.Elem,
    bookItemNumber: Int
  ) : String = {
    // This implementation will use xpath expressions such as "/bookstore/book[3]/title" for retrieving a book title
    val xPathExpression = "/bookstore/book[" + bookItemNumber + "]/title"

    // Please note that the type of the below 'xmlWithBooks' is the core Scala class 'scala.xml.Elem'
    // which does not provide the method invoked below, but still this code will work, because
    // of the implicit conversion to 'XmlExtensionWithSomeXPathSupport' (in the above imports) which defines the method used below  
    val nodeList: org.w3c.dom.NodeList = xmlWithBooks.getNodeListMatchingXPathExpression(xPathExpression)
    // (the typing is optional, e.g. the Java interface NodeList above can be automatically inferred instead of the explicit typing)
    
    if(nodeList.getLength < 1) return null // as documented in the abstract trait defining this method
    nodeList.item(0).getFirstChild.getNodeValue
  }
}

When the above code invokes the method 'getNodeListMatchingXPathExpression' of an object typed as 'scala.xml.Elem', then the Scala compiler will see that such a method does not exist, but before raising a compiling error, it will search for some implicit conversion methods instead among the imports.

One of the imports looked like this:
import xmlXpathExtension.XmlExtensionWithSomeXPathSupport.scalaXmlElementToXmlExtensionWithSomeXPathSupport
That method is indeed defined with an implicit keyword and an implementation which will convert an instance of 'scala.xml.Elem' into some other class which implements the invoked method. Therefore the above code will work. Below is the implementation of the mentioned implicit method.

package xmlXpathExtension
import org.w3c.dom.NodeList
import xml.PrettyPrinter
object XmlExtensionWithSomeXPathSupport {
  implicit def scalaXmlElementToXmlExtensionWithSomeXPathSupport(xml: scala.xml.Elem) = {
    val xmlString = prettyPrinterUsedForConvertingXmlElementToString.format(xml)
    new XmlExtensionWithSomeXPathSupport(xmlString)
  }
  private val prettyPrinterUsedForConvertingXmlElementToString = new PrettyPrinter(100, 1)
}
class XmlExtensionWithSomeXPathSupport(private val xml: String) {
  def getNodeListMatchingXPathExpression(xPathExpression: String) : NodeList = {
    // The code below returns (but note that 'return' is an optional keyword in Scala and not used below)
    // the result from an Xml Facade method, which is implemented by
    // using some code from Java XML libraries
    XmlFacade.getNodeListMatchingXPathExpression(xml, xPathExpression)
  }
}

The above implementation of the implicit conversion forwarded the real task (of returning a node list, when provided with a xpath expression and some xml content) to a Xml Facade, and the code for that class is shown below.

import javax.xml.parsers.DocumentBuilderFactory
import java.io.StringReader
import javax.xml.xpath.{XPathFactory, XPathConstants}
import org.w3c.dom.{Node, NodeList, Document}
import org.xml.sax.InputSource
/**
 * The purpose of this facade class is to provide a simplified API to the core Java classes/interfaces
 * with a public method that can return a NodeList with nodes that matches a specified
 * XPath expression in an XML structure provided as a String.
 *
 * While the singleton instance (defined as 'object' instead of 'class') is written in Scala,
 * it is using lots of Java classes/interfaces in the implementation, as you can see
 * in the above list of imports. 
 */
object XmlFacade {

  /**
   * This is the public facade method (and please note that the default access level in Scala is public)
   * and the other methods further below are only private helper methods.
   */
  def getNodeListMatchingXPathExpression(xml: String , xPathExpression: String) : NodeList = {
    val document = getInputSourceAsDocument(getXmlStringAsInputSource(xml))
    // The type is above inferred automatically but could optionally has been defined explicitly as below:
    //val document: org.w3c.dom.Document = getInputSourceAsDocument(getXmlStringAsInputSource(xml))
    getNodeListMatchingXPathExpression(document, xPathExpression)
  }

  private val xPath = XPathFactory.newInstance().newXPath()

  private def getNodeListMatchingXPathExpression(
    node: Node,
    xPathExpressionAsString: String // "AsString"-suffix purpose: a local variable will be a 'org.w3c.dom.xpath.XPathExpression'
  ) : NodeList = {
    val xPathExpression = xPath.compile(xPathExpressionAsString)
    val nodesFoundByXPathExpression = xPathExpression.evaluate(node, XPathConstants.NODESET)
    // the above returned type is Object (as declared in method signature of method 'evaluate')
    // but the concrete type will be 'org.w3c.dom.NodeList' (defined by parameter 'XPathConstants.NODESET')
    // so therefore we below will cast the Object to a NodeList when returning it
    nodesFoundByXPathExpression.asInstanceOf[NodeList]
  }

  private val documentBuilderFactory = DocumentBuilderFactory.newInstance()
  documentBuilderFactory.setNamespaceAware(true)
  private val documentBuilder = documentBuilderFactory.newDocumentBuilder()

  private def getInputSourceAsDocument(xmlInputSource: InputSource) : Document = {
    documentBuilder.parse(xmlInputSource)
  }

  private def getXmlStringAsInputSource(xmlString: String): InputSource = {
    val reader = new StringReader(xmlString);
    new InputSource(reader);
  }
}




/ Tomas Johansson, Stockholm, Sweden, 2010-02-12