[ Top ] [ Contents ] [ Updates ] [ Fields List ] [ Database tables ] [ Subjects ] [ UISIC home ] [ UIS ]
One of the basic aims of the UIS Informatics Commission has been to facilitate local and international exchange of data related to caves and karst. This project proposes standards which would achieve this. Most of the work which has been going on to date has been in preparation for opening the formal discussion. The project will be carried out in collaboration with the Karst Information Portal (KIP), and with the online Karst Index Database (KID) run by the Australian Speleological Federation (ASF). The International Geographical Union may also be interested.
In various places through this document and associated web pages you will see reference to this Australian open-source web-based KID database. This is because it is a real-world example of a national cave/karst database which for some years has already been using the standards proposed here, and so gives a degree of confidence that the proposed standards work in practice.
KID background: This national database was originally designed and implemented by myself for ASF, but always in mind was the idea to keep it suitable also for international usage. In 1985 it was used to produce ASF's 500-page book, the Australian Karst Index 1985, and in the late 90's I converted it to a much expanded relational PC-based database, including a data dictionary for the fields. Then in 2001 under the management of Mike Lake, ASF converted it to an open-source web-based database using a paid professional programmer. Mike continues to manage the KID and has been collaborating closely with UISIC in the design of new fields and tables. For more information see: Overview | Website.
The plan now is to make use of this preliminary repository of data fields and their definitions as a practical starting point for our discussions, and where necessary, modify them and add to them to ensure they are fully satisfactory for international use.
In this proposal the same database software or structure is not needed at each end of an exchange, so the standard would facilitate, for example:
Below is UISIC's current proposal. Each aspect in turn will be presented for discussion via our coming Internet forum, prior to formal comment and voting by the delegates from each interested country. In the meantime, your comments and suggestions to improve this proposal are very welcome.
[ Top ] [ Contents ] [ Updates ] [ Fields List ] [ Database tables ] [ Subjects ] [ UISIC home ] [ UIS ]
The following three requirements are needed to allow the valid transfer, comparison, analysis and/or consolidation of cave/karst data between independent databases:
It is not required that the same software or database structure be used at each end of the transfer.
We now look at these three requirements in more detail:
1. Record Identifiers
The record identifiers (database table keys) should be constructed as
follows to conveniently achieve uniqueness:
aabbbnnnnn For example: AUVSA00035
where:
Once created for a record, the identifier should never be
changed, regardless of where the record travels, or what
has happened to the original organisation, or which
organisation is currently looking after the master copy
of the record.
2. Field Definitions
When the field names and field values of international
definitions are actually being used, they will need to be
expressed in various human languages. Language-independent
numeric codes are therefore used wherever
possible as a common reference to the field name or
field value regardless of the language currently being
used.
Field names: Each field name is represented by a
simple numeric integer such that a given field with a
particular meaning has the same numeric code
regardless of the language in which its name and
definition are expressed. For example, a Field ID of "7"
has the name "Rock type" when expressed in
English. The field names themselves have been kept to a length of 25
characters.
Field values: Where field values can be chosen from
a fixed vocabulary, they are represented by a simple numeric integer code
such that a given field value with a particular meaning has the
same numeric code regardless of the language being
used. For example, a Field Value of "26" in Field 7
(Rock Type) translates to "sandstone" when
expressed in English, or "Sandstein" when expressed in
German.
Where commonly accepted local field values or codes
already exist for a field which has only local
significance, for example, "Geological Bed Names" or
"Parish", then these local codes should be used, but the
meanings will then need to be transferred, along with the
data, in any data exchange.
Field definitions: The formal definition for a field should normally
include where appropriate: the definition itself, any comments about the definition,
an example, any usage notes, and a Help text for use during data entry.
There should also be scope for the addition of locally based comments, examples and usage notes.
Each definition should be set first in English, then after its approval, translated into
as many other languages as possible.
To help find whether a particular field exists in the repository, they need to be classified
according to their subject. A preliminary Speleological Subject Classification
(SCC) scheme has been created, based on the scheme used by the UIS Bibliography Commission.
3. Transfer format
When transferring data between different databases,
UIS's standard transfer format should be used. This format will use only
standard ISO text characters, and will be independent of any
database software or structure. Therefore any database
system needs only to be able to translate to or from this
one common intermediate format in order to exchange
data with any other co-operating database system. This format or markup language
is likely to use XML (Extensible Markup Language) and its associated
standards.
To translate to or from the transfer format, the project will include an
open-source, multi-platform program to translate from
csv or fixed-width text files to the transfer format at the source end, and a
similar program to translate from the transfer format to csv or fixed-width text files
at the target end. It will also include text file templates for describing the
table structures at each end of the transfer for use by the respective program. The lengths in the following list should be used for the
fixed-length serial number component in the record IDs
of the respective entities. Note that the serial number
need only be large enough to allow for the maximum
number of records for that entity generated by the one
organisation, not for the quantity of records stored at any
one site; this is because any duplicate serial numbers will
be distinguished by the originating country+organisation
code.
The list is a draft initial list only. Further entities can be
added as needed. The original two-letter codes were chosen
to reflect the entity in more than one language where
possible; however in keeping with our policy of
language-independence, these are expected to be changed to
numeric codes, allowing users to substitute letter codes
related to their own language if desired.
[ Top ]
[ Contents ]
[ Updates ]
[ Fields List ]
[ Database tables ]
[ Subjects ]
[ UISIC home ]
[ UIS ]
The first two requirements above (identifier and
definitions) should be used from the beginning of a new database if
possible. It does not matter which database software you
use, nor the structure of your database, nor which subset
of the available fields you have chosen, provided that
you have adhered to the field and field value definitions.
For example, multi-valued fields have to stay as multi-valued fields.
The fact that many of us already have cave databases in
existence, and are already using various independent
field definitions, should not be a reason to prevent us
from establishing a standard which can be used by new
systems, or by later evolution of our existing systems
if/when we feel that the time is right. Further, as we go
through the field definitions, it is expected that we can
come up with definitions which will allow many of our
existing fields to comply with them anyway. In fact, one
of the fields in the proposed list allows classifying the level
of compliance of each existing field. Any existing fields
which are found to already comply with the standard
definitions could then be flagged as such, and validly transferred
to other databases.
The method described (a country code + an org code)
allows each organisation which produces data records to
issue internationally unique keys without needing to
refer to any central authority. The 3-letter org codes
would be set at the national level by the speleo
community in that country.
The serial number part is fixed-length, left-zero-filled, so
that the alphanumeric record IDs will sort correctly
when required. The serial number component for the ID
of a particular entity needs only to be large enough to
cover the maximum quantity of records for the entity
which could be generated by the one organisation, as
opposed to the maximum quantity of entity records
stored at any site. The proposed entity IDs and key
lengths are shown in the table above.
Public "cave numbers", while needed for normal public
usage, are not ideal for this identifier because they do
sometimes get changed, they vary in their structure, and
they can be unnecessarily long.
The record identifier does not need a component to
identify the entity type, because this can readily be
handled externally.
The scheme described above is currently already in use
in the Australian national cave/karst database.
Max Records
Length of created
ID Entity (English) Serial No. by one Org.
---- ------------------ ---------- -----------
AR article, paper 6 1M
AT attribute, field n/a
AV attribute value n/a
CA cave/karst feature 5 100K
EN entity n/a
OR organisation 4 10K
PA land parcel 5 100K
PB publication 4 10K
PE person 5 100K
PH photograph 5 100K
PL plan, map 5 100K
PM permanent mark 5 100K
PS map series 3 1K
RE region, area 4 10K
RL role 3 1K
RP report 5 100K
SM specimen 5 100K
SP species 5 100K
ST site, place 3 1K
SV survey 5 100K
SU subject n/a
SY system n/a
XK key-in batch 5 100K
XL upload batch 5 100K
XU update batch 5 100K
DISCUSSION
General
1. Record identifiers
Unique IDs
The use of an internal identifier (key) is normally
routine for identifying and linking database records.
Ideally these should normally be a simple integer containing
no actual data, however in this case they need to be globally unique so that there is
no risk of duplicating an existing key when loaded
into someone else's database. We do not want to have to
change the incoming key in such a situation, because
then any linkages ('foreign keys') between entities in the original
incoming tables would be lost.
Merging records
Regardless of record ID design, if data will be consolidated between databases for the same cave or other entity, then arrangements will need to be made for merging the data under a single record ID, usually that of the destination database.
Where local clubs or individuals are contributing to a national or other higher level database, alternatives to complex merging software could be (1) the higher level database is online, so that all contributors can log on in order to add or upload their data, or (2) distributed local databases, but where only one club is responsible for adding and uploading data, both its own and contributed, for any particular cave area. In both cases, this permits the use of a single record ID for each cave or other entity.
In all cases, a mechanism is needed to allow data contributors to get proper attribution for the data they have provided. The UIS field repository already includes many fields to allow this attribution, and sample table structures for both attribution and data quality are available.
Note however that if the key was changed in this way,
any instances of its use as a "foreign key" in
tables linked to other entities must still also be changed. (A
"foreign key" is a non-key field (usually) in a table whose
value is the same as the key field(s) of a different table.).
For example, a map entity record which describes a map
might have a field containing the ID of a cave entity
which was shown on that map; when the map and cave
records are exported with their new IDs, the cave ID value in the foreign key
field of the map record must be altered in the same way
as the cave records were. Obviously it's much simpler if
a once-only change can be made to the whole database to
align its keys and foreign keys to the international
standard; from then on, no more key conversions need to
be made.
2. Field definitions
Procedure
Field definitions will be systematically discussed in
English via the Internet before being circulated to UISIC
delegates for further comment and eventual voting. The
initial batch of fields will be first-pass general caving fields;
after some of these are out of the way and the repository mechanisms
are bedded down, we can invite contributions of
fields which are more scientific or specialised.
The suggested procedure is (improvements invited!):
An early version of a transfer format was successfully devised and used by the Australian cave and karst database in 1985 when ASF used it to produce their national cave, map and reference list, the 500-page book, Australian Karst Index 1985. UISIC subsequently developed this format further and issued a draft standard to delegates for comment at the UISIC meeting during the 1989 UIS Congress in Budapest. In 1991 ASF produced a standard formalising their Karst Data Interchange (KDI) format as used in 1985. Since then, a programme has been produced by Glenn Baddeley which translates from this early KDI transfer format into a series of plaintext csv files, each being a table for importing into a multi-table relational database. This was demonstrated at the 1993 UIS Congress in Beijing, and was used in 1999 by ASF to convert all its old 1985 data into its new PC-based relational database. From there it was exported to today's online web-based MySQL database.
Based on the foregoing successful experience, UISIC had planned to issue an updated version of the 1989 UISIC draft for further discussion and comment, however in the meantime, XML and its associated standards had become available, so the whole exchange format issue was revisited by a special UISIC Working Group. This WG aimed to use cave survey fields initially, because of significant interest in being able to transfer cave survey data. However, dealing with cave survey data proved to be too complicated and difficult at that time, and the WG eventually went into hibernation.
The plan is to start a separate WG which will pick up the threads of the old group, but this time use general cave data while developing the format (a markup language), which we have called CaveXML. Hopefully the old group can in due course recommence work on cave survey data transfer, working within the CaveXML format.
This use of XML also implies expressing the previously discussed field definitions by using XML and its associated standards.
Regardless of the actual exchange format, the basic technique for exchanging data would be as follows:
[ Top ] [ Contents ] [ Updates ] [ Fields List ] [ Database tables ] [ Subjects ] [ UISIC home ] [ UIS ]
Overview of what needs to be done:
[ Top ] [ Contents ] [ Updates ] [ Fields List ] [ Database tables ] [ Subjects ] [ UISIC home ] [ UIS ]
2010-07-19
2009-01-24
2002-12-09
2000-11-29
1997-10-26
1997-07-20 Original version
[ Top ] [ Contents ] [ Updates ] [ Fields List ] [ Database tables ] [ Subjects ] [ UISIC home ] [ UIS ]