TITLE: abstractions and data conversion

(Newsgroups: comp.lang.c++.moderated, 19 Jan 97)

[ I'm happy to see John Skaller posting again to the news
  groups. As always, his comments leave me scratching my
  head. I sometimes even read them twice. Enjoy. -adc ]


MALAK: malak@access.digex.net (Michael Malak)

>Where should one put functions which convert data from type A to
>type B?


SKALLER: skaller@maxtal.com.au (John (Max) Skaller)

	This is a good question. It shows IMHO that
OO methodology has serious faults because it cannot 
properly answer the question.

	I will show the correct, categorical, solution,
after discussing the "OO" ones, which are all inferior.
	
MALAK:

>Example 1: Color spaces.  We have classes RGB, HSV, and YIQ.  RGB
>can be considered the "baseline" format since it's closest to the
>engineering domain 

SKALLER:

	Good example.

MALAK:

>Example 2: Image file formats.  

SKALLER:

	Also a  good example.

	There is a very important  difference between these examples.
They are very good examples for this reason.

	Example 2 involves specific standardised file (data) formats.
There is no "abstraction" here except that these formats
represent images.

	Example 1 involves interfaces modelling
specific standardised _interfaces_.  Abstractions
are modelled (represented) too, even if by interfaces
rather than data. So the issues are the same
at a different level: an interface is no more or 
less abstract than a data structure, inherently:
it depends on your viewpoint.

	This is the problem with OO: it presents
a uniform monolithic observer independent
representation of an abstraction (a _specific_
interface for all clients). OTOH a pure data structure
is completely observer dependent, you cannot DO
anything with it, you have to write code to manipulate
it. You can write anything that the data structure supports.
The interface is almost entirely observer dependent,
and, it is also (consequently) bound to specific data.

	This latter is without structure,
as a _thesis_, OO is the reaction against it,
the _antithesis_. It fails for the same reasons
inverted. Categories provide BOTH facilities
in a way that the programmer can engineer
the split between data and function as desired.
They're the _synthesis_.  The formula

	thesis -> antithesis
		|
		V
	       synthesis

is due to Hegel. It represents transcendence,
revolutiuon, or paradigm shift. (Depending
on your religion :-)

-----------------------------

	I have done example 2 by using a single
abstract class with ALL the accessors to support 
RGB, HSV, YIQ etc. This method requires
reopening classes to support a new
colour metric as a native method. It is not so good.

How the data is represented is
irrelevant -- in fact one can derive a class for each
representation, and even some which _cache_
some computable values (which I have also done,
for floating point colour spaces, since floating point can be
expensive). But still, the interface is NOT entirely abstract:
it is concrete just like data, albiet one level up.

MALAK:

>Here are the possible solutions:
>
>1) Establish a policy where all conversion functions go into the
>   "baseline" format class (such as RGB and Image).  (BTW, this
>   seems to me to be the least desireable of the three possible
>   solutions.)

SKALLER:

	This is a lousy solution. Why? It breaks the open/closed
principle. See Meyer's OOSC. [ "Object-Oriented Software Construction",
by Bertrand Meyer. -adc ]

MALAK:

>2) Establish a policy where no conversion functions go into the
>   "baseline" format class.  They all go into the "other"
>   classes (such as HSV, YIQ, and LegacyImage).

SKALLER:

	This is better. But it is still not good. It is not good
to standardise one particular interface like this, unless
there is very strong consensus. Chosing RGB makes sense
for luminous colour (displays) (today, anyhow) but not for 
reflected colour (printing).

MALAK:

>3) Implement a double-dispatch mechanism into free subprograms
>   (which could be all grouped in a namespace).

SKALLER:

	I do not understand. Double dispatch is a myth.
You cannot (in general) implement a matrix of methods using two
sequences. IF you can interconvert EVERYTHING to a common
type, this mechanism, works.

	At least for image formats, this is not the case:
even the notion of a 2D array of pixels does NOT provide
the only description of an image. (Eg vector graphics,
palettes, etc etc and on and on).
So for images, in general, there is NO universal type.
(In general conversions will be "lossy", and many are
needed to minimse the loss).

MALAK:

>Option #3 seems to me to be the most OO, while #2 seems to be the
>more natural fit for C++ (less cumbersome).  And #1 would seem to
>lead to fat classes.

SKALLER:

	Let us take Example 1. Each file format
is a PUBLIC external data format. It is represented by 
one or more internal data structures. Each 
may have convenient methods but the data
must be PUBLIC. 

	Now write conversion routines.
As global functions. NOT members.
You can have any number of them. There is no
need to break encapsulation and no way to break
encapsulation because there isn't any.
There is no need to "coerce" the function inappropriately
into one or the other class.

Perhaps conversion is an isomorphism for some
formats, (preserves all the information) and perhaps not.
This is crucial structural information.
	
	So now we have algorithms and data structures.
All exposed to public view. So we need to HIDE:

	1) the details of the internal representations
  	    of the external file formats

	2) the implementation details of the conversion
	   functions

Because C++ does not support the correct unit of modularity
directly, namely the category, we need to find a tricky
engineering solution built of the available tools.
Namespaces could be useful. Another method is to
use a dummy class. In Java, such modules ARE supported
directly by the compiler. (See Java protection system).
	
	namespace impl { 
		struct GIF { .. };
		struct JPG { .. };
		GIF gif_to_jpg(GIF) { ..;. }
		// etc
	}

	namespace ImageCategory {
		class GIF {  impl::GIF gif; ..... };
		class JPG {  impl::JPG jpg; ..... };
		GIF gif_to_jpg( JPG j) 
		{
			return GIF(impl::gif_to_JPG(j.impl::jpg));
		}
		..
	}

Here in the namespace ImageCategory all the implementation
details are hidden by wrapping. The CONTRACT is:

	1) IF the CLIENT uses ImageCategory and _not_ impl then 
	changes to implementation details in impl will be transparent
	to CLIENT code.

	2) The SERVER must maintain the implementation space
	and faithfully implement the conversions wrapped in the
	client interface. Changes will not impact any CLIENT code.

	3) The COMPILER will prevent the CLIENT modifying the 
	implementation accidentally. (Provided rule 1 is kept).

The SERVER (implementation) space is OPEN: new data structures
can be added, whether new representations of file 
formats already handled, or addition of new file formats. 
Similarly, new conversions can be implemented, whether better
versions of those already implemented, or new ones not before
implemented. It is also CLOSED in the sense
that existing formats and algorithms can be left alone and used.

The interface of the CLIENT space is OPEN and CLOSED:
it can be used now, it can be extended, and it can be
modified transparently to the client to hook better internal
representations.

This model is categorical, and it obeys the open/closed principle.
It hides information properly by separating the
interface space from the implementation space.
(Note that the interfaces chosen are themselves
implementation details at a higher level)

A  weakness is that there is no compiler support for 
hiding the implementation space from the client. 
(I.e. no enforcement)

Using classes instead of namespaces
solves this problem at the expense of having to break
open classes to extend them (namespaces are extensible
by design). Many languages providing separate module constructions
provide the requisite enforcement of the access contract,
but at the expense of a separate  construction from the class
(loss of unification and hence scalability and reusability),
and usually with loss of openness (you have to break
open the modules to extend them).

The point of this design is that the correct level of modularity
is the CATEGORY and NOT THE CLASS.  Classes can be
used to model categories, but the results are not properly scalable.
The same applies to namespaces and most "module" constructions
in popular languages. (Possibly excepting SML??)

In the image format example, it is NOT the image which
is the unit. It is the _collection_ of images and the maps
between them TOGETHER which should exist at
multiple levels of abstraction. (Indeed, "levels" is the
wrong word because it smacks of hierarchy.)

It is this CATEGORY which defines the abstraction 
"Image". An image is NOT a single type. The types
of various images are distinct but related.
Categories represent images by the relationships
between them NOT just by attributes.

In particular, categorical structure defines what
an image is: images can be bitmapped, vectored,
or classified in various useful ways WHICH IS 
REFLECTED IN THE ABSTRACT RELATIONS 
BETWEEN THE (CONVERSION) FUNCTIONS.
(In reality, one would need to represent
output and input devices as well, to truly
distinguish an image from, say, a sound --
and possibly even model the eye, since a dog,
colour blind human, superman, weather satellite,
stellar interferometer, or Netscape may "see" something 
quite different :-)