simple_storm--benja: Simplify Storm by dropping headers

Author: Benja Fallenstein
Date: 2003-02-16
Revision: 1.5
Last-Modified:2003-04-10
Type:Architecture
Scope:Major
Status: Implemented

Storm is quite complex with its MIME headers, and prone to become more complex if we choose to separate hashing of headers and bodies (raw_blocks--benja). If we break backward compatibility a single time, as Tuomas suggests, we should take the opportunity to get rid of our mistakes from the past, in order to make the future simpler.

By analogy with the data URL scheme [RFC2397], this PEG proposes a URN namespace to be registered whose URIs would contain a MIME type and the content hash of a block of data. "data" URLs contain a MIME type and a sequence of bytes, either literally or encoded as base64. The analogy runs deep; "data" URLs are a MIME type plus an immutable byte sequence, and so are URIs in this URN namespace. The MIME type is included with "data" URLs because it is considered the one absolutely essential piece of metadata necessary to interpret the byte sequence; for this URN namespace, the same thing holds.

Issues

Changes

Storm blocks do not have headers any more; the hash in their URN is only of the body. Block URNs have the following form:

blockurn   := namespace "1.0:" [ mediatype ] "," bitprint
mediatype  := [ type "/" subtype ] *( ";" parameter )
parameter  := attribute "=" value

namespace is an informal URN namespace to be registered, like urn:urn-5. Before it is registered, urn:storm: is used. bitprint is a Bitzi bitprint as defined by <http://bitzi.com/developer/bitprint>; this means it's 32 characters, a dot, plus 39 more characters.

The type, subtype, attribute and value tokens are specified by [RFC2045]. All characters not in <URN chars> as defined by [RFC2141] MUST be percent escaped [RFC1630], with one special exception: The slash separating type from subtype MUST NOT be escaped. This is for easier readability, and is consistent with the use in data URLs [RFC2397] (it's also the thing most likely to be struck down in the namespace application process... but we can see whether it gets through or not).

Block URNs are completely case-insensitive; they are canonicalized by lower-casing them, character by character. Two block URNs are thus considered equal when compared ignoring case.

To make this work, in case-sensitive values, upper-case characters MUST be percent escaped, since they are not allowed in the canonical form. This is admittedly ugly, but case-sensitive values are rare. For parameters whose value is always a token as defined by [RFC2045] (for example charset), value SHOULD NOT be enclosed in quotation marks (prior to percent escaping). For parameters whose value may contain characters not allowed in token, value SHOULD be enclosed in quotation marks. Quoting [RFC2045],

token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
            or tspecials>

"X-" types aren't allowed, as they work against the persistence of Storm blocks; application/octet-stream or similar must be used instead. There is an internet-draft [draft-eastlake-cturi-04] on the use of URIs as MIME types; if this becomes standard, it should be used for extension.

Unlike in [RFC2397], if no <mediatype> is given, application/octet-stream is assumed (not text/plain).

There is a public domain Java implementation of bitprints at <http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/bitcollider/jbitprint/>. Bitprints may be registered as a URN namespace in the future, according to Bitzi. However, they will not include a content type.

- Benja