Russian Digital Libraries Journal

2003 | Volume 6 | Issue 2

SXML: an XML document as an S-expression

Kirill Lisovskiy, Dmitry Lizorkin

Institute for System Programming RAS,
Moscow State University


This article is the first in the series of the articles dedicated to application of functional techniques for XML data management, which may be effectively used for implementation of XML-based digital libraries.

We consider SXML --- a representation of the XML Information Set in the form of S-expressions (nested lists) and Scheme functional programming language which naturally processes S-expressions and thus SXML. This approach avoids the impedance mismatch problem that currently affects many XML-based applications, including digital libraries.


1.  Introduction

Textual formats for representation of semistructured data, such as SGML and now XML, support the representation of a text as an "ordered hierarhy of content objects", and are very popular as a technology for digital libraries [1].

The "XML revolution" is a fait accompli now, and XML technologies are used in virtually every area of information technologies now. Digital libraries (DL) are not an exception here, and XML is generally considered as a key technology for modern digital libraries [2].

XML data processing languages suggested by the W3 Consortium, such as XPath, XSLT, XQuery etc., are not general-purpose programming languages and are thus not sufficient for implementing complete applications. That's why most XML applications are implemented by means of traditional programming languages, such as C and Java; or scripting languages, for example, Perl, JavaScript and Python.

An attempt to combine two different languages (for example, XPath and Java) leads to a problem known as impedance mismatch. Impedance mismatch problem consists of two aspects:
  • Different data models. E.g. XPath models an XML document as a tree, while most general purpose programming languages have no native data types for a tree.
  • Different computational paradigms. Say, XSLT is a functional language, while Java is object-oriented, and Perl is a procedural one.
Impedance mismatch requires complicated converters and APIs (for example, DOM) to be used for combining such two languages.

Impedance mismatch problem can be reduced and even eliminated using the Scheme functional programming language for XML data processing:
  • Nested lists (S-expressions in Scheme) provide a natural representation for nested XML documents. Scheme represents its code and data as nested lists of dynamically typed elements. XML document, being a hierarchical structure of nested XML elements, can be thought of as a hierarchical nested Scheme list (so-called S-expression).
  • Scheme is a functional language, as most XML-languages (e.g. XSLT and XQuery) are. Scheme processes (nested) lists in a recursive manner which can be thought of as traversing/transforming the document tree.
Scheme, being a Lisp dialect, is a widely recognized scripting language [3]. It is one of the most elegant and compact programming languages practically used: the description of Scheme standard [4] consists of 40 pages only. Scheme is a high-level programming language, suitable for fast prototyping. Moreover, Scheme programs are generally several times more compact than the equivalent C programs.

This paper introduces SXML --- a representation for an XML document in the form of an S-expression. XML and SXML textual notations are much alike: informally, SXML replaces XML start/end tags with opening/closing brackets. SXML, being an S-expression and thus the main data structure for Scheme programming language, is easily and naturally processed via Scheme.

Section 2 discusses the relationship between XML and SXML from the point of view of the XML Information Set W3C recommendation. Section 3 gives a simplified description of the SXML specification. Section 4 considers some important SXML features.

2.  XML, XML Information Set and SXML

An XML document [5] is essentially a tree structure. The start and the end tags of the root element enclose the whole content of the document, which may include other elements or arbitrary character data. Text with familiar angular brackets is an external representation of an XML document. Applications ought to deal with an internalized form: an XML Information Set, or its specializations (such as the DOM). This internalized form lets an application locate specific data or transform an XML tree into another tree.

The W3 Consortium defines the XML Information Set [6] (Infoset) as an abstract data set that describes information available in a well-formed XML document. An XML document's information set consists of a number of information items, which denote elements, attributes, character data, processing instructions, and other components of the document. Each information item has a number of associated properties, e.g., name, namespace, URI. Some properties --- for example, "children" and "attributes" --- are collections of other information items. Although technically Infoset is specified for XML it largely applies to other semi-structured data formats, in particular, HTML.

XML document parsing is just one of possible ways to create an instance of XML Infoset.

It's worth a note that XML Information Set recommendation does not attempt to be exhaustive, nor does it constitute a minimum set of information items and properties. Its purpose is to provide a consistent set of definitions for use in other specifications that need to refer to the information in a well-formed XML document.

The abstract data model defined in the XML Information Set Recommendation is applicable to every XML-related specification of the W3 Consortium. Namely, the Document Object Model [7] can be considered the application program interface (API) for dealing with information items; the XPath data model [8] uses the concept of nodes which can be derived from information items, etc. The DOM and the XPath data model are thus two instances of XML Information Set.

XML Information Set Recommendation itself imposes no restrictions on data structures or interfaces for accessing information items. Different interpretations are thus possible for the XML Information Set abstract data model. For example, it is convenient to consider an XML Information Set a tree structure, and the terms "information set" and "information item" are then similar in meaning to the generic terms "tree" and "node" respectively.

An information item may be also considered as a container for its properties, either text strings (e.g. name, namespace URI) or containers themselves (e.g. child elements for an XML element). The information set is thus a hierarchy of nested containers. As noted by Oleg Kiselyov [9], such a hierarchy of containers comprised of text strings and other containers greatly lends itself to be described by an S-expression, because the latter is recursively defined as a list whose members are either atomic values or S-expressions themselves. S-expressions [10] are easy to parse into an internal representation suitable for traversal; they also have a simple external notation, which is relatively easy to compose even by hand.

SXML is a concrete instance of the XML Infoset in the form of S-expressions. Infoset's goal is to present in some form all relevant pieces of data and their abstract, container-slot relationships with each other. SXML gives the nest of containers a concrete realization as S-expressions, and provides means of accessing items and their properties. SXML is a "relative" of XPath and the DOM, whose data models are two other instances of the XML Infoset. SXML is particularly suitable for Scheme-based XML/HTML authoring, XPath queries, and tree transformations.

XML and SXML can thus be considered two syntactically different representations for the XML Information Set.

3.  SXML specification

SXML is the concrete instance of the XML Infoset in the form of S-expressions. Further discussion on SXML in this section is based on the SXML specification [9].

A simplified SXML grammar in EBNF notation is presented on figure 1. An SXML <name> is a single Scheme symbol.

[1] <TOP> ::= ( *TOP* <PI>* <Element> )
[2] <Element> ::= ( <name> <attributes-list>? <child-of-element>* )
[3] <attributes-list> ::= ( @ <attribute>* )
[4] <attribute> ::= ( <name> "value"? )
[5] <child-of-element> ::= <Element> | "character data" | <PI>
[6] <PI> ::= ( *PI* pi-target "processing instruction content string" )

Table 1: A simplified SXML grammar


Since Infoset's information item is a sum of its properties, a list is a particularly suitable data structure to represent an item. The head of the list, a Scheme identifier, names the item. For many items this is their (expanded) item name. For an information item that denotes an XML element, the corresponding list starts with element's name, optionally followed by a collection of attributes. The rest of the element item list is an ordered sequence of element's children --- character data, processing instructions, and other elements in turn. Every child is unique; items never share their children even if the latter have the identical content.

Figure 2 illustrates the XML element and its SXML form (which satisfies the <Element> production from figure 1).

<WEIGHT unit="pound"> <NET certified="certified">67</NET> <GROSS>95</GROSS> </WEIGHT>

(WEIGHT (@ (unit "pound")) (NET (@ (certified)) "67") (GROSS "95") )


Table 2: An XML element (left) and its SXML representation (right)


The value of an attribute is normally a string; it may be omitted (in case of HTML) for a boolean attribute, e.g., a "certified" attribute on figure 2.

A collection of attributes is considered an information item in its own right, tagged with a special name @ . The character @ may not occur in a valid XML name; therefore an <attributes-list> cannot be mistaken for a list that represents an element. An XML document renders attributes, processing instructions and other meta-data differently from the element markup. In contrast, SXML represents element content and meta-data uniformly --- as tagged lists. RELAX NG [11] -- a schema language for XML -- also aims to treat attributes as uniformly as possible with elements. This uniform treatment, argues James Clark [12], is a significant factor in simplifying the language. SXML takes advantage of the fact that every XML name is also a valid Scheme identifier, but not every Scheme identifier is a valid XML name. This observation lets us introduce administrative names such as @, *PI*, *TOP* without worrying about potential name clashes. The observation also makes the relationship between XML and SXML well-defined. An XML document converted to SXML can be reconstructed into an equivalent (in terms of the Infoset) XML document. Moreover, due to the implementation freedom given by the Infoset specification, SXML itself is an instance of the Infoset.

The XML Recommendation specifies that processing instructions (PI) are distinct from elements and character data; processing instructions must be passed through to applications. In SXML, PIs are therefore represented by nodes of a dedicated type *PI*. XPath and the DOM Level 2 treat processing instructions in a similar way.

A sample XML document and its SXML representation are both shown on figure 3, thus providing an illustrative comparison between nested XML tags and nested SXML lists. Note that the SXML document is a bit more compact than its XML counterpart.

<?xml version='1.0'> <di contract="728g"> <wt refnum="345"> <delivery> <date month="6" day="01" year"2001"/> <weight>783</weight> </delivery> <vehicle type="lorry" number="A567TP99"/> </wt> <wt refnum="459"> <vehicle type="car" number="25676043"/> </wt> </di>

(*TOP* (*PI* xml "version='1.0'") (di (@ (contract "728g")) (wt (@ (refnum "345")) (delivery (date (@ (month "6") (day "1") (year "2001"))) (weight "783")) (vehicle (@ (type "lorry") (number "A567TP99")))) (wt (@ (refnum "459")) (vehicle (@ (type "car") (number "25676043")))) ) )


Table 3: An XML document (left) and its SXML representation (right)


SXML can also be considered an abstract syntax tree for a parsed XML document. An XML document or a well-formed part of it can automatically be converted into the corresponding SXML form via a functional Scheme XML parsing framework SSAX [13].

It's worth a note that SXML represents all the information contained in XML documents, including comments, namespaces and external entities. These are omitted in this article for the sake of simplicity, but they are considered in the SXML specification [9].

4.  SXML features

This section considers some important features of SXML [3], which are deductable from its grammar and properties of S-expressions.

4.1.  An SXML document as a tree of uniform nodes

Since an SXML document is essentially a tree structure, it can be described in a more uniform way by introducing the term of an SXML node for nodes of this tree.

An SXML node can be defined on the basis of SXML grammar (figure 1) as a single production --- [N] on figure 4. Alternatively, an SXML node can be defined as a set two mutually-recursive datatypes --- [N1], [N2] and [N3] on figure 4. In the latter case, a Node is constructed by adding a name to the Nodelist as its leftmost member; a Nodelist is itself a (sorted) list of Nodes.

[N] <Node> ::= <Element> | <attributes-list> | <attribute> | "character data: text string" | <TOP> | <PI>
[N1] <Node> ::= ( <name1> . <Nodelist> ) | "text string"
[N2] <Nodelist> ::= ( <Node> <Node>* )
[N3] <name1> ::= <name> | @ | *TOP* | *PI*

Table 4: An SXML document as a tree of nodes


Such a consideration emphasizes SXML tree structure and the uniform representation for Infoset's information items as Scheme S-expressions.

4.2.  SXML elements and attributes

The uniformity of the SXML representation for elements, attributes, and processing instructions simplifies queries and transformations. For the SXML data model, attributes and processing instructions look like regular elements with a distinguished name. Therefore, query and transformation functions dedicated to attributes become redundant, because ordinary functions with distinquished names can be used.

The uniform representation for SXML elements and attributes is especially convenient for practical tasks [14]. Differences between elements and attributes in XML are blurred. Choosing either an element or an attribute for representing concrete practical information is often a question of style [15], and such a choice can later be changed. Such a change in a data structure is expressed in SXML as simply an addition/removal of one hierarchy level, namely an attribute-list. This requires the minimal modification of an SXML application. For the SXML notation, the only difference between an attribute and an element is that the former is contained within the attribute-list (which is a special SXML node) and cannot have nested elements.

For example, if data restructuring requires that the weight of the delivered load, initially represented as a nested element, to be represented as an attribute, the SXML element

(delivery ... (weight "789"))) will change to:

(delivery (@ (weight "789")) ...) Such a notation for elements and attributes simplifies SXML data restructuring and allows uniform queries to be used for data processing.

4.3.  SXML as a Scheme program

The syntax of LISP family programming languages, in particular, Scheme, is based on S-expressions used for both data and code representation. This makes it possible and convenient for Scheme programs to be treated as a semi-structured data [16] and vice versa.

Since an SXML document and its nodes are S-expressions, they can be used for representing a Scheme program. For making this possible, it is required (and sufficient) that the first member of every list contained in the SXML tree is a function. The rest members of the list are than the arguments, which are passed to that function. In accordance with SXML grammar, attribute and element names and special names must be bound to functions.

An SXML document or an SXML node that fulfils these requirements may be considered a Scheme program which can be evaluated, for example, by means of eval function.

For example, if para and bold are defined as functions:

(define (para . x) (cons 'p x)) (define (bold . x) (cons 'b x)) the SXML element

(para "plain" (bold "highlighted") "plain") can be treated as a program, and the result of its evaluation is the SXML element:

(p "plain" (b "highlighted") "plain") Note that the result of evaluating such a program is not necessary an SXML element. Namely, a program may return a textual representation for the source data in XML or HTML [17]; or even have a side effect, such as saving the SXML data in a relational database.

5.  Conclusions

Traditional XML applications involve XML-language and general-purpose programming languages, which leads to an impedance mismatch problem. Impedance mismatch can be successfully overcome by using the Scheme functional programming language which is widely approved as a scripting language. Scheme operates S-expressions which are particularly suitable for representing hierarchical data structures.

Information available in a well-formed XML document is described by the XML Information Set. XML, the DOM, XPath data model, etc. are different instances of the XML Infoset. This paper suggested the other concrete instance of the XML Infoset --- SXML --- that has the form of an S-expression and can thus be easily and naturally processed via Scheme.

The technology considered may be employed for implementation of XML-based digital libraries, and is especially suitable for implementation of light-weight digital libraries [18].

References

[1]
Ioannis Papadakis, Vasillios Chrissikopoulos. A Digital Library Framework based on XML.
http://citeseer.nj.nec.com/543571.html

[2]
Cuneiform Digital Library Initiative to Use XML Encoding for Third Millennium Texts.
http://xml.coverpages.org/ni2001-11-06-c.html

[3]
K. Yu. Lisovsky. XML Applications Development in Scheme. Programming and Computer Software, Vol. 28, No. 4, 2002.
http://www.maik.rssi.ru/journals/procom.htm

[4]
Revised5 Report on the Algorithmic Language Scheme. Kelsey R., Clinger W., Rees J. (Editors). Higher-Order and Symbolic Computation, Vol. 11, No. 1, September, 1998 and ACM SIGPLAN Notices, Vol. 33, No. 9, October, 1998.
http://www.schemers.org/Documents/Standards/R5RS/

[5]
Extensible Markup Language (XML) 1.0 (Second Edition). W3C Recommendation 6 October 2000.
http://www.w3.org/TR/REC-xml

[6]
XML Information Set. W3C Recommendation 24 October 2001.
http://www.w3.org/TR/xml-infoset/

[7]
Document Object Model (DOM) Level 2 Core Specification Version 1.0. W3C Recommendation, November 13, 2000.
http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113

[8]
XML Path Language (XPath) Version 1.0. W3C Recommendation 16 November 1999.
http://www.w3.org/TR/xpath

[9]
Oleg Kiselyov. SXML, Revision 2.5. August 9, 2002.
http://okmij.org/ftp/Scheme/SXML.html

[10]
John McCarthy. Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I. Comm. ACM, 3(4):184-195, April 1960.
http://www-formal.stanford.edu/jmc/recursive/recursive.html

[11]
James Clark, Makoto Murata, editors. RELAX NG Specification. OASIS, 2001.
http://www.oasis-open.org/committees/relax-ng/spec-20011203.html

[12]
James Clark. The Design of RELAX NG. December 6, 2001.
http://www.thaiopensource.com/relaxng/design.html

[13]
Oleg Kiselyov. Functional XML parsing framework: SAX/DOM and SXML parsers with support for XML Namespaces and validation. September 5, 2001.
http://www.okmij.org/ftp/Scheme/SSAX.scm

[14]
A Triumph of Simplicity: James Clark on Markup Languages and XML. Dr. Dobbs Journal, July, 2001.
http://www.ddj.com/articles/2001/0107/0107e/0107e.htm

[15]
St.Laurent S. XML Elements of Style. McGraw-Hill, 2000.

[16]
Lisovsky K. Scheme program source code as a semistructured data. 2nd Workshop on Scheme and Functional Programming. Florence, September 2001.
http://kaolin.unice.fr/Scheme2001/article/lisovsky.ps

[17]
Kiselyov O. XML and Scheme. Workshop on Scheme and Functional Programming 2000, Montreal, 2000.
http://www.okmij.org/ftp/Scheme/SXML-short-paper.html

[18]
O.Juravleva, K.Lisovsky, E.Tomusjak, G.Tomusjak. Functional approach to semistructured data management and its application in implementation of digital libraries. RCDL'2001
http://pair.com/lisovsky/misc/biblio/index.html
 

About Authors

Dmitry Lizorkin - a Ph.D. student in the Moscow State University. His M.Sc. thesis, defended in 2002, was dedicated to implementation of XML Linking Language (XLink) using functional methods.
e-mail: lizorkin@hotbox.ru

Kirill Lisovskiy - PhD, IT Consultant and Senior Researcher Institute for System Programming Russian Academy of Science. His primary area of research interests is functional and logic techniques for semistructured data management. Since 1999 he had participated in a number of research and development projects related to implementation and application of XML data management techniques based on the Scheme programming language.
e-mail: lisovsky@acm.org
http://pair.com/lisovsky/


© Kirill Lisovskiy, Dmitry Lizorkin, 2003
Last update - : 2003-12-09

Please address your comments and suggestions to rdlp@iis.ru