Silicon Dale

The Missing Data Problem

Every geologist knows the problem: there's an incomplete data set - a gold assay has been missed out by the laboratory, or there's a strange-looking co-ordinate value, a stratigraphic interval cut out by an unconformity, or a rock-type description has been forgotten. If the data must be stored and processed, something must be done to indicate that these values are missing (whether temporarily or for good). In the bad old (good old) days each program would have its own way of dealing with missing data and its own requirement for coding it. Quite often the solution consisted of inserting a -99 or some such value in place of the absent data item. There are two big problems with this type of solution: first, the difficulty of ensuring that the missing-data code could not be confused with a legitimate data value, and second, the certainty that different programs would require different missing-data codes.

With the development of database management systems, the handling of missing-data codes became more systematic. For example, in G-EXEC (see ESCA vol.15, no.11, July 2000) each data file (relational table) contained a data description in which a missing-data code was defined for each column. This missing-data code was carried with the data wherever it was copied, and was recognised and acted upon by all applications programs within G-EXEC. For new columns created by G-EXEC application programs, a missing-data code was set up that was very unlikely to be a valid data value - the lowest real number which could be represented in the computer concerned. This was perfectly adequate as long as the data were not transferred to a different computer which allowed a different range of real numbers. A similar, though simplified,approach was adopted in developing Datamine, in that -1030 was adopted as a universal numeric missing-data value, while a blank string was used for a character data null value.

With development of commercial database management systems, the question of nulls (shorthand for missing data values) became a central issue. In most early systems (and in many database systems to the present day) a hard-coded null-value solution was adopted. As Ted Codd, the originator of the relational data model, pointed out, however, any column in any file is drawn from a 'domain' or extended data type which could be defined to include any ranges of values including any special value which is chosen to be the 'null'. Although his arguments were made in the context of the relational model, they apply equally to any other type of database management system.

A solution to this problem lies in attaching an extra one-byte-wide column to each data column. This extra column would contain a flag or 'mark' for each missing data value in the column, and the data item in the column would simply be ignored whenever a mark was encountered. This is the method which was adopted in DB2 and other IBM database management systems. It has the merit that it guarantees that there cannot be any confusion between nulls and any legitimate values.

However nulls are coded, they imply a system of three-valued logic: when comparing values (for example in retrieval or table-join operations) there is either a match ('True') or a mismatch ('False') or a 'Maybe' when one of the operands is a missing value.

In his second version of the relational database model, Codd (1990) takes the argument one step further, however. He points out that there are actually two kinds of missing data. The ordinary 'missing but applicable' value - which might be supplied later (the missing gold assay) - is simply unknown, while a much stronger form of 'missing and inapplicable' (the stratigraphic interval cut out by an unconformity) is unknowable and can never be supplied. Codd proposed that these be represented by different marks 'A' and 'I' which could then be handled separately (when required) in manipulation and interrogation of the database. This yielded a four-valued logic system (T,F,A,I).

More recent work by Date and Darwen (1998) on the object-relational database model draws back from this multi-valued logic and attempts to proscribe the use of nulls altogether, proposing instead 'special values'. These seem to be just like the G-EXEC missing-data codes, and are either valid values drawn from the data type, or are values excluded from a data type which no longer includes the full range of representable values. As far as the geologist is concerned, this is a very retrograde step. In fact Codd saw that extending the definition to handle partial data was going to be necessary. Although Date and Darwen's proposal allows multiple special values, it provides no help for the partial-data problem, and the treatment of missing data is clearly not a priority for them.

What the geologist needs, indeed, is quite a wide range of representations of missing and partially missing data. The mark concept can easily be extended to include at least the following (and I'm sure there are many more):

  • - present but illegal (e.g. the co-ordinate value found to be bad)
  • - present but suspect (a co-ordinate value thought to be bad)
  • - below a threshold (e.g. below detection limit)
  • - above an upper threshold
  • - outside a defined range
  • - not within a list of acceptable values
  • - undefined as a result of a computation (e.g. 0/0)

In most of these cases, a combination of a mark and a data value will be needed. The result will be a multi-valued logic with a number of different flavours of 'maybe' which could allow for processing in different ways by applications programs as well as during database manipulation.

There are further complications. Most database management systems are designed for use in the commercial world, where exact equality can be assured - for example, part serial numbers, numbers of parts ordered, prices, and dates. In scientific data, the degree of certainty can be very variable. When comparing two data values, the normal numerical comparisons (=, >, <, ) are too rigid. What is needed is an additional set of operators such as 'approximately equal to' (within a defined tolerance ?) and 'closest match'. There also needs to be some mechanism for indication the precision or reliability of stored data in a systematic way.

This is not merely a set of academic quibbles. With progressively more mining software systems adopting ODBC standards, and abdicating responsibility for their database management to commercial systems such as Oracle or Access, it is essential that the capabilities of those systems meet the requirements of the geologist and the engineer. As both Codd (1990) and Date and Darwen (1998) point out, the SQL language is grossly deficient even in its handling of commercial database management problems, let alone the more complex scientific and engineering data, and DBMSs built around the use of SQL will suffer from all of its defects.

References:

Codd, E.F. 1990: The relational model for database management: version 2, Addison-Wesley, 538pp.
Date, C.J. and Darwen, H. 1998: Foundation for object/relational databases: the third manifesto, Addison-Wesley, 496pp.

Stephen Henley
Matlock, England
steve@silicondale.co.uk
www.SiliconDale.com

Copyright © 2000 Stephen Henley