Silicon Dale

The Man Who Wasn't There

Yesterday upon the stair,
    I met a man who wasn't there
He wasn't there again today:
    I wish that man would go away. -- Children's nonsense rhyme

In a drillhole log, there is quite often a 'comments' field. There will be entries in this for a few intervals, but for many logged intervals this may be blank. It does not necessarily mean that information for these fields is 'missing' - merely that there is nothing sufficiently interesting about the particular interval to merit a comment. Indeed, where something really is missing there will often be a comment to indicate it, such as "note to be added later", where the logged interval is so interesting that it requires further analysis. Where there is no comment, the storage representation is often an empty string. Unfortunately this is interpreted by many SQL systems as 'NULL' - or in other words 'missing'. It is of course no more missing data than the 'dog that didn't bark'. And genuinely missing information can be of several types, none of which would be interpreted as 'NULL' - for example, 'note to be added later' as above, or 'lost sample'.

I make no apologies for returning to the problem of how best to handle missing and partially missing data, which I discussed previously in ESCA (v.16,no.1). This is because, increasingly, mining software systems are becoming dependent upon SQL database management systems (through ODBC connectivity or otherwise) for their data management. Unfortunately, commercial DBMSs that are accessed through SQL are very bad at handling missing data. They use an incomplete 3-valued logic, and a 'null' whose meaning is not properly defined. In the geosciences there is actually a wide range of types of data which can be wholly or partially missing for various reasons, and as a result any data management process that relies on SQL cannot be relied upon to give correct results.

Two of the most prominent database gurus of the past 30 years, E.F. Codd who devised the relational model and C.J. Date who is perhaps its most vocal advocate, are in agreement that SQL is inadequate (indeed incorrect) especially in its treatment of missing data; however they disagree on how best the problem should be solved. The differences in their approaches to the problem of nulls are highlighted in a debate between them, published in Date (1995). Both use a variety of esoteric arguments, but their positions can be summarised quite simply:

Date believes that a relational database should contain no nulls at all. An operation on a data item (say, testing for equality to a given string) should produce just two possible answers - TRUE or FALSE, and all logic used should be restricted to these two values. He argues that in the real world there are only these two truth-values, and therefore a database which is supposed to be a representation of the real world cannot legitimately include any other truth-values. He further points out that the number of logical operators increases unacceptably fast with addition of further truth-values to the system of logic being used.

Codd takes a more pragmatic view that there really are situations where the value of a data item is unknown, and further that there are two sorts of 'unknown': missing and applicable and missing but inapplicable. 'Missing and applicable' refers to an item which exists but for some reason has not been recorded - for example the gold grade of a sample which hasn't yet been assayed. 'Missing but inapplicable' refers to an item which does not exist - for example the thickness of a stratigraphic interval cut out by an unconformity. Codd does not dispute that reality (if it could be known fully) contains only 'true and 'false' statements, but asserts that nevertheless in the real world as modelled by databases it is necessary to use 3- or 4-valued truth tables.

Date's purist view is that there really should be no such thing as a NULL anywhere in the database world. This could be interpreted as requiring a great deal of subdivision of tables. For example in a table of assays of many elements for many samples there will almost certainly be scattering of missing information: batches which were not assayed for one or another element, samples which were assayed only for one or two elements. In order to hold these data in tables without any NULLs it would be necessary to subdivide the data set into possibly a large number of smaller tables each recording data for only a different small subset of samples and elements. This is clearly not a practicable proposition - and the proliferation of small tables defined for no other reason than avoiding the use of NULLs is quite clearly against one of the goals of relational database management: simplification of navigation of the database.

Date's preferred solution is the 'default-value' approach, in which a special value is defined for each column of a table to represent missing values. The particular advantage of this approach was that it avoided the use of anything identified as a NULL. However, it merely shifts the burden of dealing with missing information from the database management system to the application or the user. Codd (1990) presents very convincing arguments against the use of default values.

The problem is more acute for observational sciences such as geology than for the commercial world, because there are many more ways in which data may be missing or incomplete, as I discussed in the previous column. Furthermore, SQL-based database management systems fail to meet even the requirements of either Date or Codd, and certainly fail to meet the more complex requirements of geoscience.

There is merit in the arguments of both Codd and Date. It is necessary, as Codd emphasises, for missing data to be treated properly within the database management system without relying on applications to implement their own methods - which are likely themselves to be inconsistent and incomplete. Yet, as Date points out, proliferation of multiple logic-states within the DBMS would produce an exceedingly complex system. Is there some compromise which satisfies both proponents ? Indeed, are they even in fundamental disagreement at all ? And do we, as practising geologists, need to worry ourselves about all this theorising ? The answers to these three questions, in order, are 'maybe', 'no', and a resounding 'yes'. It is crucial that we understand how our database engines work, and how to ensure the integrity of the data which we use as our raw material for deposit modelling and mine planning. Without panicking, we need to be aware that the standard ODBC database interface behind our mining software does not give us licence to forget about the database problems. There are ways to deal with the problems. From simpler to more complex:-

  1. attention to database design - for example a full understanding and proper use of normalisation
  2. work-arounds to handle the various missing-data and incomplete-data problems
  3. development of a more science-friendly database management environment

Item 1 is something that anyone should be able to do, and should be required to do, when setting up and operating a database.

Item 2 will certainly require some careful logical thinking (especially when it involves slippery questions such as geometric precision) and may require some applications programming

Item 3 involves the development of a whole new database management environment suitable for observational sciences. It may not have the multi-user and transaction-processing features required by an airline reservations system - but it would handle missing or incomplete data in a consistent and logical way.

References:
Codd, E.F. 1990: The relational model for database management: version 2, Addison?Wesley, p.203-204.
Date, C.J., 1995: Relational Database Writings, 1991-1994; Addison-Wesley, chapter 9, p.341-362.

Stephen Henley
Matlock, England
steve@silicondale.co.uk
www.SiliconDale.com



Copyright © 2002 Stephen Henley
The man who wasn't there: Earth Science Computer Applications, v.18,no.4,p.1-3