Data History & Attribution Subsystem - Table Design

Updated: 15 Jul 2002

This is an example of a physical table structure for recording Data History, or Lineage, one of the components of Data Quality. It also allows for Data Attribution, i.e. acknowledgement of who supplied the data (person and/or organisation). See also the further explanatory notes below. Comments and suggested improvements are invited.

Tips for using this page:

To view the full width of the tables it may be necessary to change the size of your browser's font.
To find a particular Field ID, use your browser's text search facility.

[ Notes ] [ Legend ] [ Updates ] [ Other Tables ] [ Proposal ] [ Fields ] [ Field definitions ] [ UISIC ] [ UIS ] [ ASF ]


                                                +------
                                                ¦
CA0000                                          ¦
Cave (for example) Master Table                 ¦
+---+-------------------------+----+ 1   »      ¦
¦227¦Cave ID                  ¦A10*¦¦»----------+
¦...¦etc etc other fields     ¦... ¦            ¦  
+---+-------------------------+----+            ¦  
Similar master table for other entities         ¦ 
which are recording data history.               ¦ 
                                                ¦
                                                ¦
CA0421                                          ¦  
Cave Updates (m-m link)                         ¦  
+---+-------------------------+----+ 0:n «      ¦  
¦227¦Cave ID                  ¦A10*¦¦«----------+  
¦421¦Update ID                ¦A10*¦¦»-----+    ¦  
+---+-------------------------+----+ 1 »   ¦    ¦  
This table enables links to all            ¦    ¦  
Update Batches for a Cave (shown), and     ¦    ¦ 
all Caves for an Update Batch (not shown). ¦    ¦  
It would be automatically added to by the  ¦    ¦  
program whenever a cave is added or        ¦    ¦  
updated during an Update Batch.            ¦    ¦  
                                           ¦    ¦
                                           ¦    ¦
XU0000                                     ¦    ¦  
Update batches                             ¦    ¦  
+---+-------------------------+----+ 1:1 « ¦    ¦  
¦421¦Update ID                ¦A10*¦¦«-----+    ¦
¦422¦Update - person ID       ¦A10 ¦L 2         ¦  
¦423¦Update - organisation ID ¦A9  ¦3           ¦  
¦424¦Update - date            ¦D   ¦2           ¦  
¦425¦Update - person          ¦A25 ¦1 3         ¦  
¦426¦Update - org code        ¦A3  ¦L 2         ¦  
+---+-------------------------+----+            ¦  
This table shows who (person, org, date) has    ¦
supplied this batch of primary data. It may not ¦
be the same org as operating the database site. ¦
The Update ID is auto-generated. One Update ID  ¦
(batch) could cover several entities.           ¦
(1) If not available via ID link 422.           ¦  
(2) Mandatory entry by user.                    ¦
(3) Auto-filled from Person or Org table.       ¦
                                                ¦  
                                                ¦
                                                ¦
CA0427                                          ¦  
Cave Key-in Sessions (m-m link)                 ¦  
+---+-------------------------+----+ 0:n      « ¦  
¦227¦Cave ID                  ¦A10*¦¦«----------+  
¦427¦Key-in ID                ¦A10*¦¦»-----+    ¦
+---+-------------------------+----+ 1  »  ¦    ¦
This table enables links to all            ¦    ¦  
Key-in Batches for a Cave (shown), and     ¦    ¦ 
all Caves for a Key-in Batch (not shown).  ¦    ¦  
It would be automatically added to by the  ¦    ¦  
program whenever a cave is added or        ¦    ¦  
updated during a Key-in Session.           ¦    ¦  
                                           ¦    ¦
                                           ¦    ¦
XK0000                                     ¦    ¦  
Data key-in sessions                       ¦    ¦
+---+-------------------------+----+ 1:1 « ¦    ¦
¦427¦Key-in ID                ¦A10*¦¦«-----+    ¦
¦428¦Key-in - person ID       ¦A10 ¦L 2         ¦
¦545¦Key-in - organisation ID ¦A9  ¦4           ¦
¦429¦Key-in - start date      ¦D   ¦4           ¦
¦449¦Key-in - end date        ¦D   ¦4           ¦
¦430¦Key-in - person          ¦A25 ¦1 3         ¦
¦546¦Key-in - org code        ¦A3  ¦5 6         ¦
+---+-------------------------+----+            ¦
This table shows who originally keyed in the    ¦
primary data. It may not be the same org as is  ¦
operating the current database site. Where      ¦
keyed data is imported from another             ¦
site, the original Key-in ID should be          ¦
retained and a new one not generated.           ¦
The Key-in ID is auto-generated.                ¦
One Key-in ID (batch) could cover several       ¦
entities and several Update batches.            ¦
(1) If not available for display via            ¦
    ID link 428.                                ¦
(2) Mandatory entry by user.                    ¦
(3) Auto-filled from Person table.              ¦
(4) Auto-filled by system.                      ¦
(5) If not available for display via            ¦
    ID link 545.                                ¦
(6) Auto-filled from Org table.                 ¦
                                                ¦
                                                ¦
CA0450                                          ¦
Cave Uploads(m-m link)                          ¦
+---+-------------------------+----+ 0:n      « ¦
¦227¦Cave ID                  ¦A10*¦¦«----------+
¦450¦Upload ID                ¦A10*¦¦»-----+  
+---+-------------------------+----+ 1  »  ¦  
This table enables links to all            ¦  
Upload Batches for a Cave (shown), and     ¦ 
all Caves for an Upload Batch (not shown). ¦  
It would be automatically added to by the  ¦ 
program whenever a cave is included in     ¦ 
an Upload Batch.                           ¦ 
                                           ¦  
                                           ¦  
XL0000                                     ¦  
Uploads                                    ¦  
+---+-------------------------+----+ 1:1 « ¦  
¦450¦Upload ID                ¦A10*¦¦«-----+  
¦451¦Upload - person ID       ¦A10 ¦L 2
¦452¦Upload - source org ID   ¦A9  ¦3
¦543¦Upload - target org ID   ¦A9  ¦3
¦453¦Upload - date            ¦D   ¦3
¦454¦Upload - person          ¦A25 ¦1
¦455¦Upload - source org code ¦A3  ¦4 3
¦544¦Upload - target org code ¦A3  ¦L 2
+---+-------------------------+----+
This table shows the who, when and where
of a data transfer between two systems. It
may be an upload to a more central system
or it may be between peer databases.
The Upload ID is auto-generated by the
source system. One Upload ID (batch) can 
include several Update IDs and entities.
(1) If not available via ID link 451.
(2) Mandatory entry by user.
(3) Auto-filled by system.
(4) If not available via ID link 452.


»-----------------------------------------------+
                                                ¦
CA0431                                          ¦
Cave (for example) latest data history          ¦
+---+-------------------------+----+ 1:1 «      ¦  
¦227¦Cave ID                  ¦A10*¦¦«----------+ 
¦447¦Data awaits uploading    ¦A1  ¦4           ¦  
¦ 76¦Latest update by         ¦A40 ¦3 4         ¦  
¦431¦Latest update ID         ¦A10 ¦» 2         ¦  
¦432¦Latest update - person   ¦A25 ¦1           ¦  
¦433¦Latest update - org code ¦A3  ¦1           ¦  
¦434¦Latest update - date     ¦D   ¦1           ¦  
¦435¦Latest key-in ID         ¦A10 ¦» 2         ¦  
¦436¦Latest key-in - person   ¦A25 ¦1           ¦  
¦547¦Latest key-in - org code ¦A3  ¦1           ¦  
¦437¦Latest key-in start date ¦D   ¦1           ¦  
¦448¦Latest key-in end date   ¦D   ¦1           ¦
¦456¦Latest upload ID         ¦A10 ¦» 2         ¦  
¦457¦Latest upload - person   ¦A25 ¦1           ¦  
¦458¦Latest upload source code¦A3  ¦1           ¦  
¦548¦Latest upload target code¦A3  ¦1           ¦  
¦459¦Latest upload - date     ¦D   ¦1           ¦  
+---+-------------------------+----+            ¦  
This optional cave table shows only the latest  ¦
updates for a cave. It is read-only, all values ¦
having been auto-filled by the system.          ¦
If data history is being kept via the other     ¦
tables then this table is not necessary, and    ¦
field 447 would be held instead in the Cave     ¦
master table. If data on only the latest updates¦
is being kept, then omit the other tables and   ¦
keep it here. If minimal latest-only data is    ¦
being kept, then store only fields 447 and 76.  ¦
If desired, all candidate fields for this table ¦
can be stored instead in the Cave master table  ¦
and this table omitted altogether. Other        ¦
entities would be similar to all the above.     ¦
(1) If not available via associated ID link.    ¦  
(2) If the many-to-many link tables are not     ¦
    being used by the Cave's master table,      ¦
    then this field could 1:1 link to its resp  ¦  
    history entity master table to avoid the    ¦
    need to store the associated following      ¦
    fields here.                                ¦
(3) Old ASF field - to be replaced by 431 etc.  ¦  
(4) Put in master table if this table is not    ¦
    present. This flag is set as soon as any    ¦
    addition or update is done to this Cave     ¦
    since the last Upload, and reset after the  ¦
    Upload.                                     ¦  
                                                ¦
                                                ¦
CA0460                                          ¦
Field value qtys updated per upload per org.    ¦
+---+-------------------------+----+ 0:n      « ¦  
¦227¦Cave ID                  ¦A10*¦¦«----------+
¦450¦Upload ID                ¦A10*¦
¦460¦Field ID updated         ¦S*  ¦
¦519¦Updating organisation ID ¦A9* ¦
¦461¦Values qty before upload ¦S   ¦
¦462¦Values qty after upload  ¦S   ¦
¦463¦Qty of new/changed values¦S   ¦
+---+-------------------------+----+
This table keeps the history of the 
quantity of updates for each field. 
It is also used in calculating the 
total update count for an organisation 
at any given time. It could be the 
result of several Update Batches by 
that organisation in the Upload Batch. 
This View shows the fields updated and 
the organisation for a given cave and
Upload. It may be necessary to limit 
an Upload to Updates from only one 
organisation (see main Notes). This 
read-only table exists only in the 
central db. All its fields are 
auto-filled by the system.



The tables below enable linking between
Updates, Key-ins and Uploads. For example
they would enable you to find all the
Update Batches included in a particular
Upload. These tables would be 
automatically added to by the data-entry
program during normal operations.

XL0421
Updates per Upload.
+---+-------------------------+----+
¦450¦Upload ID                ¦A10*¦
¦421¦Update ID                ¦A10*¦
+---+-------------------------+----+


XL0427
Key-ins per Upload.
+---+-------------------------+----+
¦450¦Upload ID                ¦A10*¦
¦427¦Key-in ID                ¦A10*¦
+---+-------------------------+----+


XK0421
Updates per Key-in
+---+-------------------------+----+
¦427¦Key-in ID                ¦A10*¦
¦421¦Update ID                ¦A10*¦
+---+-------------------------+----+

The table below is an alternative 
location for 447[Data awaits uploading] 
when updated status is being monitored
on a per field basis rather than on the
per cave basis shown above.

CA0447                                
Cave (for example) field update status
+---+-------------------------+----+ 
¦227¦Cave ID                  ¦A10*¦
¦ 82¦Field ID                 ¦S*  ¦
¦447¦Data awaits uploading    ¦A1  ¦
+---+-------------------------+----+

Notes:

General:

Each of the main 3-column rectangles above represents a data table in a database, and lists the fields, i.e. columns, which are in that data table. Note that the rectangles are not a picture of the tables, they are just the lists of fields forming the columns of each table. The rows of each data table would be a separate instance of what that table was about, e.g. each row in the cave master table (Ca0000) would represent another cave.
The links show some of the ways in which the tables would be linked in order to show particular "views" of the data. For example, the links between the first three tables in the left column would enable a screen or report to be created where you could see the who and when of all updates to a given cave.
The reason that the diagram has been built up from standard text characters is so that the page's source code can be copy/pasted to allow easy plain-text emailing of table designs during discussion. Although the tables appear in two columns above, they are in a single column in the source code. The characters have been chosen from ISO-8859-1 and will render in the same way also in Central European: Windows-1250; Cyrillic: KO18R and Windows-1251; Baltic: Windows-1257; Greek: ISO-8859-7 and Windows-1253; and Turkish: ISO-8859-9.

Data History

Although the diagram shows how the Data History tables link to the Cave entity tables, they could apply to any other cave/karst entity, e.g. cave maps, in the same way.

Attribution quantities may be required for use in the aportionment of income from any data licencing revenue.

An Upload Batch may contain several Key-in Batches, and a Key-in Batch may contain several Update Batches.

Here is a typical basic sequence of events: (For "cave" below, also read any other entity. See also the notes below each table diagram above.):

A club or person provides a batch of new and/or updating cave data to the local database administrator (DBA), who may in fact be in a different club. This could be for one or more caves. One batch of such data by one person or org is referred to as an "Update Batch". It would typically be paper-based data, but could also be in digital or any other format. For example, in the Australian context it would typically be data from multiple club trips accumulated on to the paper Australian "Cave Summary" form.
Sometime later the DBA or assistant enters a batch of this data into the local database. This could consist of one or more of the above Update Batches. The collection of one or more Update Batches which are keyed in during one effective session are referred to as a "Key-in Session". A Key-in Session could occur as several actual sessions close together - whatever the DBA feels is reasonable to refer to as one "session".
Upon starting the above Key-in Session the operator causes the system to assign a new Key-in ID to identify the session, and when each Update Batch is started the operator causes the system to assign it a new Update ID, noting the details of the original supplier and the data date. Thus the people supplying the data, the person entering it, and the dates can be identified in the database. The Update and Key-in IDs are assigned from their respective series on this database site.
If at some time later the local DBA uploads some or all of the data to another database, e.g. from local to state, or from state to national, or to a peer database, then the Upload Batch is allocated an identifying Upload ID. The data uploaded should include the Upload ID and all the relevant Update IDs and Key-in IDs, and possibly earlier for the data uploaded.
The above arrangement allows the recording of who has supplied data (person and/or org), when, which fields, and how much data. It also lays the foundation for recording who supplied any particular piece of data, which is explained in more detail on the Data Quality page.
In practical terms, much of the identifying data can be automatically supplied by the data-entry software program at the time, causing minimal extra data entry by the operator in order to achieve the recording of the data history.

The various ID types are structured according to the recommendations in UISIC's data exchange proposal and are taken from a database site's series as follows:
Update Batches: Allocated by the Key-in site.
Key-in Batches: Allocated by the Key-in site.
Upload Batches: Allocated by the database site initiating the upload. However if the data is being uploaded interactively or from a site which does not allocate conforming Upload IDs, then it would be allocated by the target site.
Note that if digital data is being imported rather than being keyed in, and it is missing Update IDs and/or Key-in IDs, then these can be allocated by the importing site, attributing the data and/or key-in to the correct organisation and person.
A distinction needs to be made between the data date and the date that the data was submitted for data entry (the Update date). The Update Batches table above stores the submission date, whereas the Data Quality tables store the data date. For example, if in 2002 a cave records keeper found an early report dated 1987 which contained some new information, and he submitted that information for data entry in 2002, then the submission date field 424[Update - date] would be the 2002 date, and the data quality field 472[Data date - year] would be 1987.
Of course there is a range of levels of data history which a site may decide to store. Some examples:
- Store no data history at all, where only the lastest data is stored, with no record of where it came from or how old it is.
- Just keep details of who supplied the latest updates, and that only on a per-cave basis.
- Keep the lot, down to field level, as shown on this page. See also the Data Quality page.

Data Attribution Quantification

While it is quite easy to record who supplied data for a cave and when, it is more difficult to numerically quantify just how much updating can be attributed to various organisations or individuals. This quantification may be needed if some kind of updating "score" is being kept to encourage competition in updating among clubs, or if there is a need to distribute any revenue from data licensing or publication in an equitable way. Below are some points concerning this subject. Where "organisation" is mentioned, an individual who supplies data independently is also assumed:

The count should probably be calculated and stored by the most central database site of the system, typically a national or state site which is contributed to by local databases and organisations. One of the reasons for this is because the count has a certain kudos or monetary value associated with it, and so precautions need to be taken to minimise the opportunity for falsification.
Obvious opportunities for falsification would be by adjusting the update counts directly, or by supplying many pseudo-updates (a series of changes with a net result of no change). Having the central data readily available to the caving population can help reduce the latter.
An "update" would be calculated at field level and could be: A new value added, an existing value changed, or an existing value deleted.
If update counts are being kept, then the data flow through the system needs to be organised so that at the counting site, the incoming data attributable to each organisation can be distinguished.
The above implies that either:
- the supplying organisation connects directly to the central site to enter or upload their updates under their own name; or
- updates are coming via an intermediate site which collects data from more than one organisation, in which case this site's uploads to the central site must segregate the data according to which organisation originally supplied it. The latter implies that any upload successfully groups incremental updates according to the supplying organisation, and in the correct time sequence, so that updates to the same field in the same cave by different organisations should result in the correct final value for that field. If multiple organisations have been updating the same fields in the same cave on that database site, then simply sending the final current values for those fields is not adequate, and the incremental-updates-stream method must be used.
A simpler approach would be:
- Orgs without a local database: update the central database directly and under their own name, e.g. directly via a web interface or uploading via a web interface.
- Orgs with a local database: Key in only their own updates into their database and upload only these updates to the central database, not those from other organisations. To keep their own database up-to-date with data from other organisations, download the latest data from the central database, but take care not to flag it as new data for uploading again.

The incoming data at the counting site can be compared to the existing data and the counts thereby calculated - incoming data cannot be assumed to be new data. Counting can be done as follows for both single- and multi-valued fields:

For each incoming field:
   Reset the counters.
   Record the field's existing value(s).
   Record and store the quantity of existing values.
   Apply the incoming update(s) for the field.
   Record and store the quantity of final values for the field.
   For each final field value:
      If its value is not among the initial values:
        Increment the new/changed count.
      Endif
   Endfor
   Store the new/changed count.
   If final qty < initial qty of field values:
      deletions count = initial qty - final qty
   Endif
   totalupdates count = new/changed count + deletions count
Endfor

Legend:

Columns shown:
1. Standard UIS Field ID (FID).
2. Field name.
3. Data type.

Codes:
* = A primary key. 
L = A lookup table is used to set this field's values.
R = Right-justified.
» = This field provides a link to another entity, i.e. is a "foreign key".
x = This field is not yet included in the Aust Karst Index.
- The names above the table descriptions are example filenames for the 
  tables using max 8 chars.
- Any numbers to the right of a field refer to the notes below that table.

Table linkages (respective instances in detail table per instance in master table):
0:1 = from 0 to 1 instances
0:n = from 0 to many instances
1:1 = always 1 instance
1:n = at least 1 instance
The vertical bar(s) at each end of a link show the actual fields linked.

Data types:
S    = Short integer, up to 32,768.
N    = Numeric, decimals allowed.
Annn = Alphanumeric up to 255 chars long (A1-A255).
D    = Date.
M    = Memo (variable length free text).
B    = BLOB (Binary Large Object, e.g. a photo image).

[ Top ] [ Tables ]

Updates:

15-Jul-2002

Added para regarding data date and data submission date.
Table CA0447 Latest Data History renamed to CA0431, and added new table CA0447 Field Update Status.
Added comment about various levels of data history which could be kept.
Added Attribution to title.

7-Jun-2002. Initial release.

Copyright © 2002 Peter Matthews. May be reproduced for non-commercial use provided this copyright paragraph is retained. This ongoing development of field definitions etc is being conducted by the Informatics Commission (UISIC) of the International Union of Speleology (UIS). The Australian Speleological Federation's (ASF) national Karst Index Database has been used as a pilot.
Page address: http://www.uisic.uis-speleo.org/exchange/histables.html
P. Matthews. Email: matthews@melbpc.org.au