KIBI Network Specifications :: Kixt

Kixt Transmissions

Abstract

The following specification defines a transmission format for documents. This format has been designed for use with Unicode and Kixt charsets, although it may conceivably be adopted by other character sets as well.

1. Introduction

1.1 Purpose and Scope

This specification defines the Kixt Transmission Format, a means of encoding texts for transmission and storage. It is a text-based format, which means that it is intended for use with sequences of codepoints which are intended to represent characters according to some character set. It is not suitable for transmitting binary information.

In essence, the Kixt Transmission Format is a container format (like OGG or MP4), but for plain text rather than multimedia.

This specification does not define the means of transmitting the codepoints of a Kixt transmission, only the semantics that those codepoints entail. It may conceivably be used in conjunction with networking protocols, or with files being read from or saved to disk.

The file extension .kixt or .kxt is suggested for saved documents which are encoded according to the Kixt Transmission Format.

1.2 Relationship to Other Specifications

This document is part of the Kixt family of specifications. It is also built upon RDF technologies.

In this document, the following prefixes are used to represent the following strings:

Prefix Expansion
kixt: https://vocab.KIBI.network/Kixt/#

2. Character Sets

A character set defined by a Kixt Charset Definition is transmission compatible if and only if:

The Unicode character set, as well as the ASCII subset thereof, is assumed to be transmission compatible. Whether other character sets are transmission compatible is left undefined by this specification.

Outside of a page, the character set of a transmission is https://charset.KIBI.network/Kixt/Controls.

In this specification, characters within a character set are identified by codepoint (always listed in hexadecimal). For purposes of readability, the name used in https://charset.KIBI.network/Kixt/Controls is also provided. However, there is no requirement that all transmission compatible character sets necessarily use these same names.

A character set is variable-width compatible if it only defines codepoints in the range 0000FFFF, and the set of bytes used for the most significant and least significant places of characters are disjoint for all codepoints greater than 0000. A character set defined by a Kixt Charset Definition is additionally only variable-width compatible if the value of its kixt:supportsVariableEncoding property is true.

The requirement of disjointedness for variable-width compatible character sets means that determining the codepoint a byte belongs to only requires looking at the previous or next byte.

2.1 Control Characters

The control characters are, for any Kixt charset, any character with a kixt:basicType of kixt:CONTROL. The control characters of other character sets are not defined by this specification. Many (although not all) transmission characters are control characters.

All control characters which are not transmission characters are invalid within documents and should be replaced with 1A INVALID on reading.

The following [control characters] in a transmission compatible character set are transmission characters:

2.2 Overview of Characters

The broad meanings and interpretation of the characters whose semantics are defined by this specification are given below. Note that the specific usage of these characters will be clarified in later sections.

00 NULL

A meaningless format character. This character may be used for byte-padding within a source text, and does not represent anything in itself. All programs should ignore 00 NULL characters whenever they appear, treating them as though they were not present.

Of particular importance, programs or algorithms which preform searching or sanitization functions must strip all 00 NULL characters from their input before processing. Failure to do so may allow unsafe syntactic constructs to propagate to places where they should otherwise not be allowed.

01 HEAD

Begins a new header. This character is only valid at the beginning of a page; in all other situations it should be replaced on reading with 1A INVALID.

02 BEGIN

Begins a new text. This character is only valid at the beginning of a page or to close a header; in all other situations it should be replaced on reading with 1A INVALID.

03 FINISH

Ends a page. Implies a closing of any open header or text.

04 DONE

Ends a transmission. Only for use in networked contexts; should not be saved to disk.

0E LEAVE

Opens a data block which effectively switches the character set to Unicode until 0F RETURN is encountered, which closes the block. The characters in the data contents of this block must not be in the range 011F or 7F9F. This character is only allowed in headers; elsewhere, or if no valid data block can be created, this character should be replaced on reading with 1A INVALID.

0F RETURN

Closes the data block begun by 0E LEAVE. In all other situations, it should be replaced on reading with 1A INVALID.

16 IDLE

A meaningless transmission character. Unlike 00 NULL, this character should not be ignored, and is not valid in every location (for example, inside of headers or texts). However, it is given no particular meaning by this specification.

17 BREAK

Ends a transmission block. Only for use in networked contexts; should not be saved to disk.

18 CANCEL

Ends a transmission by indicating that it was made in error. Only for use in networked contexts; should not be saved to disk.

19 END

Closes a document, signalling that it is complete.

1A INVALID

Replaces an invalid character in a transmission on reading.

1C VOLUME SEPARATOR
1D PART SEPARATOR
1E CHAPTER SEPARATOR
1F SECTION SEPARATOR

[Format characters][format characters] for subdividing a sequence of characters into progressively finer-grained divisions. These have special semantics inside of headers; their usage in texts is not defined (but not prohibited) by this specification.

7F NOTHING

A meaningless control character, used to signal an empty message. Unlike 00 NULL, this character should not be ignored, and is not valid in every location. Furthermore, unlike 16 IDLE, this is not a transmission character and is invalid inside of documents. However, it is given no particular meaning by this specification.

2.3 Data Blocks

A data block is a sequence of codepoints which do not necessarily belong to the character set of the surrounding text. Every data block begins with one or more opening data characters, and ends with zero or more closing data characters. The data contents of a data block are the non–00 NULL codepoints between these two sequences. The only data block defined by this specification is the sequence of codepoints beginning with an opening data character of a valid 0E LEAVE and ending with a closing data character of a valid 0F RETURN. Other specifications may define other data blocks.

For Kixt charsets, characters with a kixt:basicType of kixt:DATA are intended for sole use in data blocks. These are the data characters. The data characters of other character sets are not defined by this specification.

Although the character sets of data blocks are not necessarily known, data contents are nevertheless encoded the same as any other characters.

2.4 Messages

A message is a sequence of codepoints which performs a similar function to a control character. Each message begins with a single messaging character. For any Kixt charset, the messaging characters are those characters with a kixt:basicType of kixt:MESSAGING. The messaging characters of other character sets are not defined by this specification.

A messaging character may be followed by any of the following, which comprises part of the message and determines its message contents:

Otherwise, the message is invalid and the messaging character should be replaced with a 1A INVALID on reading.

All messages are invalid within documents and the entirety of any such message should be replaced with a single 1A INVALID on reading. The meaning of messages outside of documents is not defined by this specification.

2.5 Unicode Mappings

For any Kixt charset, the Unicode mapping of a sequence of codepoints is the sequence of Unicode characters which results from:

  1. For each data character which is not part of a data block, FFFD.

  2. For 1A INVALID, FFFD if it is inside of a text or header, and 1A otherwise.

  3. For each character other than 1A INVALID which is not part of a data block, replacing the corresponding assigned Kixt character with the codepoints, in order, indicated by the kixt:unicode property, or with FFFD if no character has been assigned.

  4. For each data block which begins with 0E LEAVE and ends with 0F RETURN, replacing the data block with the codepoints which result from interpreting its data contents as a sequence of UTF-16 code units (as defined by Unicode), interpreting any ill-formed code unit subsequence as FFFD.

    For clarity: The code units of a data block may in fact be encoded using UTF-8, depending on the encoding form of the transmission. The data contents of a data block are a sequence of 16-bit codepoints, not the sequence of bytes which represent them in any given encoding form.

  5. For other data blocks, the single character FFFC, unless specified otherwise by a relevant specification.

For a sequence of Unicode characters, the Unicode mapping is the sequence itself. For characters in other character sets, the Unicode mapping is not defined by this specification.

3. Kixt Transmission Format

3.1 Transmissions

A Kixt transmission is a sequence of bytes conforming to the semantics of this specification. Transmissions may be read from a file or transmitted over a network.

3.1.1 Transmission encoding

An encoding scheme is a means of representing a sequence of codepoints as a sequence of bytes. Five possible encoding schemes are defined:

Generalized UTF-8
As defined by the WTF-8 specification
Fullwidth-BE
For Unicode, UTF-16BE. For Kixt charsets, each codepoint is represented as a sequence of two bytes, with the most significant byte first.
Fullwidth-LE
For Unicode, UTF-16LE. For Kixt charsets, each codepoint is represented as a sequence of two bytes, with the least significant byte first.
Variable-BE
For variable-width compatible character set, each codepoint within the text of a document is represented as a sequence of either:
  1. For characters for which one byte is 00, the other byte.

  2. For all other characters, the codepoint a sequence two bytes, with the most significant byte first.

For all other character sets, and outside of texts, the same as Fullwidth-BE.

Variable-LE
For variable-width compatible character set, each codepoint within the text of a document is represented as a sequence of either:
  1. For characters for which one byte is 00, the other byte.

  2. For all other characters, the codepoint a sequence two bytes, with the least significant byte first.

For all other character sets, and outside of texts, the same as Fullwidth-LE.

For transmission compatible character sets, the 00 byte will always be the most significant byte. The primary advantage to variable-width encodings is that they allow ASCII texts to be interpreted without modification, while otherwise maintaining a 16-bit encoding scheme.

It is recommended that the bytes 80–9F and E0–FF be used as most-significant bytes, and that the characters A0–DF be used as least-significant bytes.

The encoding scheme applies to all characters in a transmission, regardless of character set.

If the first two bytes of a transmission are FE followed by FF, then the encoding scheme for the transmission is Variable-BE. If the first two bytes of a transmission are FF followed by FE, then the encoding scheme for the transmission is Variable-LE. In either case, these first two bytes are otherwise ignored.

In transmissions, the 00 NULL character is ignored. This means that any Fullwidth-BE–encoded sequence of characters can safely be interpreted as Variable-BE, and any Fullwidth-LE–encoded sequence of characters can safely be interpreted as Variable-LE. Consequently, separate encoding detection for Fullwidth-BE or Fullwidth-LE is not required.

Otherwise, the encoding scheme is Generalized UTF-8.

3.1.2 Transmission blocks

In certain networked situations, a transmission might need to be broken into multiple parts, known as transmission blocks. The character 17 BREAK may be used to signal the end of a transmission block without implying that a transmission has terminated.

In non-network contexts, or in other contexts where transmission blocks are not used, 17 BREAK should be replaced with 1A INVALID during reading, unless it is terminal (optionally followed by 04 DONE or 18 CANCEL), in which case it should be ignored.

3.1.3 Transmission termination

The end of a transmission may be signalled by the use of 04 DONE or 18 CANCEL. The former of these characters signals a completed transmission, whereas the latter signals that a transmission was made in error and should be discarded.

The end of transmissions should generally only be signalled when communicating over a network. Non-terminal 04 DONE and 18 CANCEL characters which appear in non-network contexts should be replaced with 1A INVALID during reading. Terminal 04 DONE and 18 CANCEL characters which appear in non-network contexts should be ignored.

3.2 Documents

A document is a sequence of bytes which forms a cohesive whole. Documents may consist of any number of pages. A transmission may consist of multiple documents, or only part of one.

A document is automatically opened within any transmission whenever no document is currently open and a character is encountered which, after reading, is not a control character or part of a message. If no document is presently open, a new document may also be explicitly opened by 01 HEAD or 02 BEGIN. Documents are closed by 19 END.

The “after reading” stipulation above is meant to indicate that characters which are ignored on reading (00 NULL), or which are replaced on reading with 1A INVALID, do not open a new document.

The ends of documents should be signalled via 19 END regardless of context (and even for saved files). A lack of a 19 END character indicates that a document is incomplete, perhaps because it was not fully saved, or because a transmission was ended before its end could be reached.

Files saved to disk should only contain a single document—and consequently, only a single, terminal 19 END character. Programs which concatenate documents should take care to only concatenate the pages of the documents, and not non-page content. When reading a file from disk, a non-terminal 19 END character should be replaced with 1A INVALID during reading.

3.3 Pages

A page is a section of a document consisting of an optional header, followed by a text. Documents may contain any number of pages. A new page is automatically opened in a document whenever no page is currently open and a character is encountered which, after reading, is not a transmission character. Pages can also explicitly be opened with a 01 HEAD or 02 BEGIN character.

Control characters and messages have already been replaced with 1A INVALID at this point, as we are inside of a document.

Pages extend until the end of their document, or can be explicitly closed with a 03 FINISH character (thus allowing the opening of a new page).

3.3.1 Page headers

A header provides metadata information about a page. It is opened by 01 HEAD and continues until the start of the text (signalled by 02 BEGIN) or the end of the page, whichever occurs first. Headers must precede texts; consequently, a page with a header necessarily must begin with 01 HEAD (optionally preceded by any number of 16 IDLE, 1A INVALID, or 7F NOTHING characters).

The contents of headers define RDF triples whose subject is a node representing the current document. The predicate and object of each triple is separated by 1D PART SEPARATOR, and the triples themselves are separated by 1C VOLUME SEPARATOR. If no triple can be formed from a sequence of characters so delineated (either because it does not consist of two components separated by a 1D PART SEPARATOR, contains invalid characters, or does not conform to appropriate RDF semantics), the entire sequence (up until the next 1C VOLUME SEPARATOR, or the end of the header) is ignored.

The sequence of characters representing the predicate of the RDF triple must have a Unicode mapping which is a valid IRI. The sequence of characters representing the object of the RDF triple must have a Unicode mapping which is one of the following:

character set

The character set of headers is initially https://charset.KIBI.network/Kixt/Transmission. The predicate kixt:charset can be used to define the character set for all subsequent RDF triples; the object of this predicate is valid if it is an IRI representing a transmission compatible character set. If multiple kixt:charset predicates with valid objects are declared in a header, all but the first are ignored.

Because no character set has yet been declared, the use of 0E LEAVE and 0F RETURN is required to define the kixt:charset of a header. This is intentional, as it explicitly forces the character set of a kixt:charset declaration to be Unicode.

The IRI http://www.unicode.org/versions/latest/ identifies the Unicode character set.

For closely-related character sets, multiple kixt:charset predicates may be used to indicate fallbacks should the preferred character set not be available.

media type

The kixt:mediaType predicate can be used to specify the media type of the text contents of the page. Its value should be an HTTP media type, as an xsd:string.

3.3.2 Page texts

A text provides the text contents of a page. A new text is automatically opened in a page whenever no header is currently open and a character is encountered which, after reading, is not a transmission character. Texts can also be opened explicitly with a 02 BEGIN character. Texts continue until the end of the page.

If a page ends before a text is opened, it has an empty text. The opening and closing 02 BEGIN and 03 END characters, if present, are not considered part of a text's contents.

Texts must not contain [transmission characters] other than 1D INVALID. Any other transmission characters within a text's contents should be replaced with 1D INVALID on reading.

If a kixt:charset with a valid object was declared in the header to the page containing a text, it provides the character set of the text. Otherwise, the character set of the text is left to programs to determine. In the absence of prior knowledge, programs should assume that texts are in Unicode.

As a consequence of the above rules, and in the absence of any special configuration or knowledge, a plain-text document which contains no transmission characters is assumed to consist of a single page with a single Unicode text.

4. Changelog

Clarified the distinction between transmission characters, control characters, and other characters defined in this specification; 7F NOTHING is no longer a transmission character proper.

The Unicode mapping of 1A INVALID is now FFFD inside of headers and texts.

Transmission characters now have a kixt:basicType of kixt:CONTROL and are a subset of control characters.

7F NOTHING is now a transmission character.

FFFE and FFFF are now allowed in data blocks.

Messages are now formally specified.

Data blocks which are not part of a message can now open a document.

Clarified that variable-width encodings are fixed-width outside of texts.

Clarified the UTF-16 interpretation of the data contents of a 0E LEAVE data block.

A Kixt charset is now only variable-width compatible if it has a true kixt:supportsVariableEncoding. This provides charsets with a mechanism for forward-compatibility guarantees (or not).

Initial specification.