A centroid can be thought of as a simple inverted index mechanism that can be shared amongst servers in a network environment in order to provide hints as to the location of data in a large, loosely coupled distributed database. A centroid is used by a server or user client to provide it with hints as to which other servers might contain information that is relevant to a user's search. These hints are known as ``forward knowledge''.
The centroids are extracted from the data held in a server and effectively remove any duplication that is present in the database. As many items of data are replicated in many records in most databases, centroids are usually much smaller than the size of the database from which they have been extracted.
It may be useful to be able to distinguish between the centroid concept and the centroids themselves. The former is the idea of extracting a minimal, representation of the unique data in a database whilst the other is a particular implementation of that idea. It is a bit like the concept of a car being a motorised vehicle whilst an implementation of a car is a particular realisation of this concept in a particular colour, with a certain engine, etc, etc.
Centroids were originally developed as part of the WHOIS++ system. The centroid concept could be used with other record formats, although the resulting centroids might look a little different to those presented here. As ROADS is based on WHOIS++ and the IAFA style templates, we will concentrate on how the centroid concept can be implemented with those to support distributed search across multiple networked servers. If centroids prove useful to SBIGs in the ROADS architecture, no doubt other groups will attempt to provide them for other database search and retrieval mechanisms.
Firstly, let us consider a simple template type such a cut down SERVICE template that only contains a Title, Description and URI attributes. We will imagine that there exists a small SBIG that has three of these templates, the contents of which are:
Template: SERVICE Title: Social history server Description: This server provides researchers in social history with -resources that may be of interest. URI: http://biguni.edu/soch-hist.html URI: ftp://biguni.edu/pub/socialsci/history/ Template: SERVICE Title: Military history database Description: A database of pointers to military history resources. URI: http://vicious.army.mil/ Template: SERVICE Title: Medical history server Description: A server that provides pointers to resources dealing -with medical history. URI: http://medhist.hospital.gov/
Obviously this is much smaller than the contents of any of the real SBIG's database, but it will serve to illustrate how centroids are formed from the templates in an SBIG's database. Even with this small example, it can be seen that a number of words appear in the same attribute in multiple templates. For example, all three templates have the word history appear in the value associated with the Title attribute.
To form a centroid for these templates we merely look through all of them for the list of unique words associated with each attribute. By unique words we mean that we only record the first instance of a word, even though it might occur many times in that attribute in all the templates. Therefore the centroid generated from the Title attribute would be:
Title: Social -history -server -Military -database -Medical
Note that the original Title fields in the templates contained a total of nine words in the values whereas the centroid only has six words. The removal of redundant multiple occurances of words typically makes the centroid associated with an attribute much smaller than the original data held in that attribute in all templates in the database. As real SBIG databases are obviously much larger than this simple example, the chance that the same words will appear over and over again in multiple templates is increased, and so the relative size of the centroids are likely to be smaller. We can expect that some centroids from production SBIGs may be quite large when viewed by themselves but relatively small when compared to the original data that they were extracted from.
If a database makes use of more than one template type then a centroid will contain a list of template types, the attributes contained in each type and the unique words within each attribute of each type. This means that, for example, the Title attribute of the SERVICE template would be treated as a different attribute to the Title attribute of the DOCUMENT template. This increases centroid size slightly as different template types in an SBIG's database are likely to share some common attributes and the unique words in these attributes will have to be repeated. However, the number of template types in use in a production SBIG database relative to the number of templates is likely to be sufficiently small that this will not pose much of a problem.
It is also worth noting that a server is free to omit any attributes it wishes from the centroids that it generates. There are a number of reasons for doing this. Firstly, there may be some attributes in the templates in the database that can only be searched by specific users and so these attributes and their contents should not be advertised to other servers. There may also be some attributes that are in the templates and can be returned as the results of a user search but do not often appear in users' queries. For example, the URI fields in the above templates are unlikely to be searched on by users as if they already know a URL, they will not gain very much extra useful knowledge by querying the SBIG service.
A full centroid for the above templates is included in Appendix A.