Danny Ayers points out a neat summary by Robin Good of ideas about "Data Emergence".
The phrase "data emergence" really only captures one aspect of the process. Something can emerge only after a critical mass of data has been collected. This creates a tricky catch 22; in order for a database to be "aided through normal, selfish use" (Dan Bricklin, cited by Jon Udell, cited in turn by Robin Good), it is necessary for a database to exist to be used in the first place. The rule is also harder to apply in situations where any "database" such as it is is distributed.
Which is the sort of situation which occurs in supporting the development of (simulation) models of the physical environment and the associated data processing, about which I was talking this week in Bristol (abstract, slides [PDF, 320Kb], for what it's worth) and about which I will talk again next week in Delft (I'll post the slides as updated for that when I get back, and I need to start writing this into a paper for Hydroinformatics 2004 soon).
In that talk I said, "Don't throw information away." Throwing information away is exactly what happens all the time in model development and application activities at the moment (it happens everywhere, but lets stick with my pet case study). Raw data (for example from remote sensing or ground survey) are processed, and the processed data used for some purpose, but the processing steps and reference to the raw data from which the processed data are derived are discarded (or "not kept", but in this case sins of omission and commission can I think be conflated without concern since it is clear to everyone that this information is, or will be in the future, critical).
Often even the raw data are discarded. I think this is true of at least some of the weather radars operated by the Met. Office here in the UK. In this case it is hang over from days of yore when storage on that scale was beyond the reach even of a national meteorological office, but it needs fixing fast.
Closer to home, in the last progress meeting of the Next Generation of Flood Inundation Models project, it was observed that some of the data aqcuired for the project is billed as "geo-referenced", but it is quite unclear what is meant by this and quite unlikely that any reasonably strict definition geo-referenced could be applied. This example draws attention to the fact that linguistic descriptions of processing steps are still not enough; the resulting descriptions will most likely, if they are made at all, be minimal and questionable to the degree that they are actually content-free.
The frustrating thing here is that these processing steps are almost invariably applied with the aid of software, and that software could, if the appropriate frameworks were in place, keep track of this information without placing additional demands on the user (and so without being ambushed by Doctorow's Metacrap straw men).
Think of a persistent undo facility, where each data set carries its own processing history with it. The undo analogy isn't perfect, since in many cases knowing a forward transformation does not imply an ability to reverse it, and (as was emphasised to me after I made over-optimistic claims regarding the rate of decrease of mass storage costs without allowing for increasing demand) keeping each intermediate step is still prohibitively expensive. It is however plausible that checkpoints could be kept, and intermediate stages could be recreated by following the processing sequence forward from the nearest checkpoint.
These trails should be firmly attached to the data set the derivation of which they describe, so that when someone is handed that data set in twenty years time the trail is still there. This might at first seem to be at odds with Earl Mardle's comment on an earlier metadata-related post of mine.
Precisely. And for it to be worth anything, it must also be held separately from the original data. That way, others can contribute to the development of the metadata or annotation, of the document. [Earl Mardle: Metadata As Web Service]
Of course it isn't at odds really. If (no small if, but lets not get stuck here for now) I can refer to a given data set using a URI, then I can say things about it anywhere I want to, whether I own the data set or not, as can anyone else. I can decide which of the statements other people have made about that data set (those which are made visible to me) I want to make use of. But it is essential that if I have access to the data itself, I have access to information about its provenance, and it makes little sense to do other then trust the supplier of the data to supply that.
Recent Comments