Riding the Media Bits

A very general way of classifying information is based on its being structured or unstructured, i.e. on its appearing or being organised according to certain rules. Characters are a special form of structured information because they typically express information that is the result of mental processes. Characters have been one of the first types of structured information that computers have been taught how to deal with. Represented as one (or more than one) byte, characters can be easily handled by typically byte-oriented processors.

At the beginning, characters were confined within each machine because communication between computers was still a remote need. In those days, memory was at a premium and the possibility of a 70% saving, thanks to the typical entropy value of structured text (a value that is usually found when zipping a text file), should have suggested more efficient forms of character representation. This did not happen, because the processing required to convert text from an uncompressed to a compressed form, and vice versa, and to manage the resulting variable length impacted on an even scarcer real estate: CPU. Later the need to transmit characters over telecommunication links arose but, again, the management of VLC information was a deterrent. SGML, a standard developed in the mid 1980s, was designed as a strictly character-based standard. The same was done for XML more than 10 years later, even though progress of technology could have allowed bolder design assumptions.

A picture of the Mato Grosso taken from an airplane or the noise coming from a crowd of people can be considered as unstructured. Depending on the level at which information is assessed, however, information can become structured: the picture of an individual tree of the Mato Grosso or an individual voice coming from a crowd of people usually represents structured information. Humans are very good at understanding highly complex structured information and one of the first dreams of Computer Science (CS) has been to endow computers with the ability to process information carried by picture or sound signals in a way that is comparable to humans' abilities, but teaching computers to understand proved to be a hard task.

Signal Processing (SP) has traditionally dealt with (mostly) digital pictures and audio, but has used CS tools to make progress in its field of endeavour. In contrast with CS, the SP community had to deal with huge amounts of information: about 4 Mbit for a digitised sheet of paper, at least 64 kbit/s for digitised speech, about 1.5 Mbit/s for digitised stereo sound, more than 200 Mbit/s for digital video, etc. To be practically usable, these large amounts of bits had to be reduced to more manageable values and this required sophisticated compression coding algorithms that removed irrelevant or dispensable information.

In the early days, transmission was the main target application and these complex algorithms could only be implemented using special circuits. For the SP community the only thing that mattered were bits, and not just "static" bits, as in computer "files", but "dynamic" bits, as indicated by the word "bitstream".

It is worth revisiting the basics of picture and audio compression algorithms keeping both the SP and the CS perspectives in mind. Let's consider a set of video samples and let's apply an algorithm to convert them into one or more sets of variables taking MPEG-1 Video as an example. We take a macroblock of 2x2 blocks of 8x8 pixels each and we calculate one motion vector, we add some other information such as position of the block, etc. and then we apply a DCT to each of (differences of) the 8x8 pixel blocks. If we were to use the CS approach, all these variables would be tagged with long strings of characters and the result would probably be more bits than before. Instead, in the SP approach the syntax and semantics of the bitstream containing VLC-coded Motion Vecors and DCT coefficients is standardised. This means that the mapping between the binary representation and the PCM values at the output is standardised, while the equivalent of the "tagged" representation that exists in all implementations of decoders is not.

The CS community did not change its attitude when the WWW was invented. In spite of the low bitrate of the communication links used in the early years of the web, transmission of an HTML page was (and still is) done transmitting the characters in the file. Pictures are sent using JPEG for photo-like pictures and GIF for graphic pictures (even though many authors use one format for the other purpose and vice versa). JPEG has been created by the SP community and applies compression; Graphics Interchange Format (GIF) has been created by the CS community and only applies mild lossless compression. Portable Network Graphics (PNG), a picture representation standard developed by the W3C, applies little or no compression, probably because users are expected not to be concerned with their telephone bills, as PNG is said to be patent-free.

This means that for the past 10 years or so the world has been misusing the telecommunication infrastructure by sending about three times more information than should have been strictly required, while a simple text compression algorithm could have been used. Not that the telco industry should complain about it - the importance of its role is still measured by how many bits it carries - but one can see how resources have been, and still are, misused because of a specific philosophical approach to problems. The transmission of web pages over the "wireless" Internet are of interest to the telcos because bandwidth is really narrow there and for some time there have been companies offering solutions enabling the sending of "compressed" web pages to mobile phones. Each of the solutions proposed was proprietary, without any move to create a single standard, so the hype subsided and those incompatible solutions went nowhere.

A similar case can be made for VRML worlds. Since these worlds are mostly made of synthetic pictures, the development of VRML was largely made by representatives of the CS community. Originally it was assumed that VRML worlds would be represented by "files" stored in computers, so the standard was designed in the typical CS fashion, i.e. by using text for the tags, and long text at that. The result has been that VRML files are exceedingly large and their transmission over the Internet is not practical. The long downloading time of VRML files, even for simple worlds, is one of the reasons for the slow catch up of VRML. Instead of working on reducing the size of the file by abandoning the text-based representation, so unsuitable in a bandwidth-constrained network environment, the Web3D Consortium (as it is now called) has developed a new XML-based representation of its VRML 97 standard, which keeps on using long XML tags. I fail to understand why a long XML tag should be any better than a long non-XML tag.

One of the first things MPEG, with its strong SP background, did when it used VRML 97 as the basis of its scene description technology, was to develop a binary representation of the VRML textual representation. The same was done for FBA and indeed this part of the MPEG-4 standard offers another good case to study the different behaviour of the CS and SP communities. As an MPEG-4 synthetic face is animated by the value of 66 parametres, a transmission scheme invented in the CS world could code the FAPs somewhat like this:

where the semantics of expression, joy, etc. is normative. Transmission would simply be effected by sending the characters exactly as they are written above. If the coding were done in the SP world, the first concern would be to try and reduce the number of bits. One possibility could be to group the 66 FAPs in 10 groups, where the state of each group would be represented with 2 bits, e.g. with the following meaning:

00	no FAP transmitted
11	all FAPs in the groups are transmitted with an arithmetic coder
01	some FAPs, signalled with a submask, transmitted with an arithmetic coder
10	some FAPs, signalled with a submask, interpolated from actually transmitted FAPs

Another interesting example of the efforts made by MPEG to build bridges between the CS and SP worlds is given by XMT. Proposed by Michelle Kim of IBM Research, XMT provides an XML-based textual representation of the MPEG-4 binary composition technology. As described in the figure below

this representation accommodates substantial portions of SMIL, Simple Vector Graphics (SVG) of W3C and X3D (the new name of VRML). Such a representation can be directly played back by a SMIL or VRML player, but can also be binarised to a become a native MPEG-4 representation that can be played by an MPEG-4 player.

As already reported in the MPEG-7 portion, another bridge has been created with BiM, contributed by Claude Seyrat of Expway. BiM has been designed as a binariser of an XML file to allow bit-efficient representation of Ds and DSs, but can also be used to binarise any XML file.