XML Canonicalization
by Bilal Siddiqui
September 18, 2002
This two part series discusses the W3C Recommendations Canonical
XML and Exclusive XML Canonicalization. In this first part I
describe the process of XML canonicalization, that is, of finding the
simplified form of an XML document, as defined by the Canonical XML
specification. We'll start by illustrating when and why we would need to
canonicalize an XML document.
Introduction
XML defines a format for structuring data so that information can be
meaningfully interchanged between communicating parties. The rules for XML
authoring are flexible in the sense that the same document structure and
the same piece of information can be represented by different XML
documents. Consider Listings
1 and 2, which are
logically equivalent, i.e. they follow the same document structure (the
same XML Schema) and are meant to convey the same information. In spite of
being logically equivalent, the XML files of Listings 1 and 2 do not
contain the same sequence of characters (or sequence of bytes or octets).
In this case the character and octet sequences of the two XML files
differ due to the order of attributes appearing in the room
element. There can be other reasons for having different octet streams for
logically equivalent XML documents. The purpose of finding the canonical
(or simplified) form of an XML document is to determine logical
equivalence between XML documents. W3C has defined canonicalization rules
such that the canonical form of two XML documents will be the same if they
are logically equivalent.
Whenever we are required to determine whether two XML documents are
logically equivalent, we will canonicalize each of them and compare the
canonical forms octet-by-octet. If the two canonical forms contain the
same sequence of octets, we will conclude that the two XML files are
logically equivalent.
Before we start exploring the technical details of the canonicalization
process, let's see when and why you would need to test logical equivalence
between XML documents.
The Need to Find Logical Equivalence between XML Documents
XML Digital Signatures is a W3C recommendation defining an XML format
for signing XML (or non-XML) documents. Signing XML documents can be
considered analogous to signing paper documents in ink. If you ever need
to sign paper documents (e.g. bank checks) while doing business in the
conventional way, you'll likely need to sign digital documents while doing
business electronically over the Internet.
Let's consider the process of signing a bank check. A bank check is
part of a system that is designed to handle the following two critical
requirements:
- Integrity: A check, once signed and issued cannot be edited by
an unauthorized party. If someone tries to modify the amount written on a
check, he will damage the background imprint on the check. The background
imprint on the check, therefore, helps in maintaining confidence in the
integrity of bank checks.
- Non-repudiation: The signing authority cannot deny the act of
issuing the check. The signatures on the check are verifiable against
sample signatures lying with the bank.
The two requirements are exactly the same in XML Digital
Signatures. The background imprint on a bank check is analogous to a
Message Digest algorithm, while the sample signatures lying with the bank
are analogous to a private-public key pair in a PKI trust service.
Message Digests and PKI are very common terms used in literature
covering security over the Internet. However, for those XML programmers
who are new to security issues, the following is a brief description of
these terms. For more details, I've included a reference at the end of
the article.
Message Digests
Digest algorithms act upon and digest (consume) message octets to
calculate a digest value. Digest values depend upon the message digest
algorithm and the message itself. If an intruder modifies the message, its
digest value can be used to detect the modification. Therefore message
digests are the electronic way of enforcing integrity.
Public Key Infrastructure (PKI) Trust Services
Trust services produce and maintain private-public key pairs. Trust
services are the custodians of private-public key pairs, just like your
bank is the custodian of your money as well as your account details.
Private keys are kept confidential i.e. only the trust service and the
user (signing authority) know the value of the private key. On the other
hand public keys are open to everybody. A signer will use his private key
to sign a message electronically along with the message digest
value. Anyone can use the corresponding public key to verify the signature
and the message digest. Public keys can only be used to verify signatures;
they cannot be used to produce signatures. Therefore, as long as your
private key is confidential, unauthorized signatures cannot be produced.
Signing Digital Data
When a user wants to sign an XML document, there is a two step
procedure. First, the user digests the XML file to be signed and produces
a digest value. Secondly, the user signs a reference to the XML message and the
digest with the user's private key.
Verifying Signatures
An application receiving the signed message will validate the
signature using signer's public key and will verify the integrity of the
message by checking the message digest. The process of digesting an XML
message relies on the sequence of bytes that represent an XML message. It
is actually the sequence of bytes that produces a message digest value and
the signatures. This means that Listings 1 and 2, although representing
the same information, will not produce the same digest value. If you sign
Listing 1 and the receiving party tries to verify Listing 2 against your
signature (which makes perfect sense, since the two documents are carrying
the same information), the verification process will fail.
Canonicalizing Before Signing and Verifying
Canonical XML finds its application in this scenario. The signing
authority will digest and sign the canonical form of the document instead
of the original form. Similarly, at the time of verification, the
canonical form will be verified instead of the original XML. Thus all
logically equivalent versions of the signed XML document will result in
successful verification (validation) of XML digital signatures.
Canonicalizing an XML document
The canonical XML specification has defined an algorithm to author the
canonical form of XML documents. You will need to perform the following
steps in order to canonicalize an XML document:
1. Encoding Scheme
All XML documents are composed of human readable text, which is a
sequence of characters. Encoding schemes are meant to represent characters
by octets. Therefore, the same XML file can be represented by an entirely
different octet stream just by changing the character encoding of the XML
file.
The canonical XML specification dictates that the canonical form of XML
documents should be encoded in UTF-8 encoding. Therefore, if the XML file
to be canonicalized has any other encoding, it should be changed to UTF-8.
2. Line Breaks
Line breaks in text files are normally represented either by
hexadecimal A (decimal 10) or hexadecimal D (decimal 13) or a combination
of these two octets. XML files are all simple text files, therefore #xA
and #xD are used as line breaks in all XML files.
The canonical form of XML requires that all line breaks (#xD or a
combination of #xA and #xD) be replaced with #xA. This should be done
before starting to process the XML file.
3. Attribute values are normalized
All attributes are required to be normalized in canonical form, as if
by a validating XML parser. The process of attribute value normalization
is stated in the XML 1.0 recommendation by W3C (see Resources).
A simple example of attribute value normalization is demonstrated by
Listings 3 and 4. Listing 3 is the original
XML file before attribute value normalization, while Listing 4 shows all
attribute values in normalized form.
All attributes in Listing
3 are of string type without any entity and character references. In
this case, attribute value normalization simply means the normalization of
white space (all types of white space i.e. tabs, line breaks and normal
non-breaking space should be converted to #x20, which is the octet that
represents non-breaking white space). All id attributes in Listing 3 have two tabs in
them. Each occurrence of tab in the id attribute in Listing 3 value has been
changed to a space in Listing
4.
[1] [2] Next