Sunday, December 11, 2005

Data Version Control

This is just an idea I had:

People are always changing the way they organise their data. It is in the nature of knowledge that concepts are fluid and change from time to time. This same principle applies to IT projects. When a new project starts, the developers have immature ideas about the domain they are working in. Hence they create simple data models. When the project matures, so do their concepts about the domain. These changes may be simple, like adding a property to a class of objects, or they can be bigger, as when two distinct concepts need to merge or a concept splits up. This can be a lot of work for a database developer. The database always reflects the latest view of the domain, in a very static way. Changing the database also means changing all data that is inside it, or is to be imported later. The main idea here is that data is static.

On the other hand, we have version control systems that work on documents, and program code in particular. Code is allowed to change. Files may be added or deleted, classes can be added and modified. And all these changes may be retracted as well. Because we have version control systems, like Subversion. Documents in such a system are tagged by revision numbers. The differences between every two revisions are stored. Let's say that code is flexible, code is allowed to be fluid.

This train of thought can lead us to a combination of these ideas. Is it not possible to apply the concepts and techniques of version control systems to database management systems and the data that depends on it?

The first consequence could be that, whenever a datastructure, like one or more database tables, changes, the diff, or difference, between these structures, is stored as well. This diff should be seen as a procedure to change data from one structure (or revision) to another.

The second consequence could be that a piece of data is not only interpreted as a datatype like a word, string, object, or array; but is connected to a revision number as well. The old data can be used by newer versions of applications by applying the change-procures to it. And, the other way around, newer versions of data may be used by older applications.

Such a system would allow you to change your datastructures without having to change legacy data. This old data will be automatically upgraded when it enters your application.

Concepts are fluid. It would be great if we could write the software to support it.


Post a Comment

<< Home