canon3_file_format: A canonical, N3-based file format

Author: Benja Fallenstein
Date: 2003-04-01
Revision: 1.6
Last-Modified:2003-07-05
Type:Architecture
Scope:Major
Status: Irrelevant

We need a canonical file format for storing data in CVS (canonical so that diffs will only show the differences in structure, not changes because one RDF writer chose to order triples differently than another writer or so). This format could also be a potential candidate for storing versions of RDF graphs in Storm.

This PEG specifies such a format.

Issues

Specification

The name of the format is Canon3. This version is identified by the URI <http://fenfire.org/2003/Canon3/1.0>. It is related to both Notation 3 and NTriples. Canon3 files are encoded as UTF-8, normalized to Unicode Normalization Form C. They obey the following grammar:

document ::= header (triple)*
header ::= "# Canon3 <http://fenfire.org/2003/Canon3/1.0/>" NEWLINE
triple ::= subject " " property " " object "." NEWLINE
subject ::= URItoken | anonNode
property ::= URItoken
object ::= URItoken | anonNode | literal
URItoken ::= "<" URIref ">"
anonNode ::= "_:" [A-Za-z][A-Za-z0-9]*
literal ::= #x22 #x22 #x22 string #x22 #x22 #x22 qualifiers
qualifiers ::= ("@" language)? ("^^" type)?
type ::= URItoken

A conforming processor must not accept faulty Canon3 files.

The NEWLINE token may be any of CR, LF, CRLF, and the Unicode LINE SEPARATOR (U+2028). This is necessary for CVS, to be useful across platforms. In contexts where the specific form used matters, the newline character is LF. (In particular, when computing a content hash-- e.g., when creating a Canon3 Storm block.) It would be nicer to use LINE SEPARATOR, but that would be an incompatibility with N3.

A string is any UTF-8 character sequence encoded in the following way:

For example, the string f\oo"""""ba"r becomes f\\oo\"\"\"""ba"r.

Strings may contain newlines. Like all of Canon3, they are encoded in Normalization Form C. They are enclosed in triple double quotes (see production literal).

The triples must be ordered. Two triples are compared by comparing their subjects, properties, and objects in this order. Each of these parts is compared as follows:

A triple may only be listed once; if there are two equal triples in the graph to be serialized, this triple must occur only once in the serialization.

URIref is one of the following:

  1. An RDF URI reference encoded in UTF-8 (Normalization Form C) as the rest of Canon3.
  2. An RDF URI reference with everything before the fragment identifier (if any) omitted. This refers to the current document (in the case of the empty string) or to a fragment of it (e.g., #foo).

language is a Language-Tag as defined by [RFC 3066]. If present, language and type indicate the language tag and data type of a literal.

Here's an example Canon3 file:

# Canon3 <http://fenfire.org/2003/Canon3/1.0/>
<> <http://example.org/name> """Foobar
An example Canon3 "document\""""@en.
<> <http://example.org/name> """Foobar
Ein Beispiel eines Canon3-"Dokumentes\""""@de.
<> <http://example.org/isa> <http://example.org/document>.
<#Foo> <http://example.org/name> """Foo fragment identifier""".
<http://example.org> <urn:x-files:rating> """7"""^^<http://www.w3.org/2001/XMLSchema#int>.
<http://example.org> <urn:x-foo:related> <urn:x-foo:rittlefricks>.

We will register a MIME type for Canon3.

- Benja