OJP

Customized Schema Subsets

GTRI
Home
Technical Documentation
Global JXDM Developer's Workshop
XML Tools
Schema Subset Generation Tool (SSGT)
Global JXDM Issue Tracking
Office of Justice Programs (OJP)
National Information Exchange Model (NIEM)
On This Page
The Problem
The Solution
The Details
The Schema Subset
External Schema Subsets
The Extension Schema
The Document Schema
Alternatives to Schema Subsets
The Schema Subset Generation Tool
Requirements
Standard Commercial Registries
What needs to be done now
Future work
The Analogy

The Problem

The Justice XML Data Dictionary (JXDD) is the result of an effort by the Justice and Public Safety community to produce a set of common, well-defined data elements to be used for data transmissions. To accommodate the needs of a large and varying user base, the dictionary has grown relatively large. Though a large dictionary in itself is not a problem, users can experience difficulties when using the large XML schema generated from the full dictionary.

Some of the commonly heard comments regarding the size of the schema are:

To address some of these concerns, we will start off by saying that yes, the full JXDD schema is too big for some of the popular XML tools. Validation in tools like XMLSpy is time-consuming at best and impossible at worst, depending on the capabilities of the system running it. The full schema is over-inclusive. Many users will not want every element to be able to occur repeatedly. Using some of the components as-is will result in having available elements that are prohibited from certain transmissions. Few people will probably ever need to use the entire contents of the full schema. The size of the full Justice schema can prevent many users with the products available today from being able to use it effectively.

The Solution

The release of the JXDD Schema is just one step in a multi-step process. Our first step was to compile a list of the commonly used data elements from actual sources and to narrow that list down and refine it to result in a set of precise and well-defined data components. We have accomplished the bulk of that step, although refinements continue to be made when needed as people review the work. Our next step is to provide a mechanism that enables users to retrieve ONLY those components from the dictionary that are needed. This is the basic idea behind customized schemas or what we call schema subsets - smaller schemas that define only those components from the dictionary that the user wants to include. Can the full Justice schema continue to be used? Yes. Does it have to be? No, certainly not.

The smaller schema subsets can be much more manageable than the full Justice schema. Notice that when we were listing the problems that users were experiencing, we identified them to be problems due to the size of the full schema, not problems due to the size of the dictionary. The schema subset approach has the benefit of providing (1) a large, precise dictionary of components that is needed to accommodate a large base of users across different domains, and (2) smaller schemas for validation that define only the components needed for a given application.

The Details

The general idea of a schema subset actually represents multiple documents to be extracted and formed by the JXDD Registry. The full set of these documents include a schema subset, external schema subsets (representing code tables), an extension schema, and a document schema. Each is discussed below.

The Schema Subset

A schema subset is an extraction of the full Justice dictionary. Instead of using a full schema that defines everything from the dictionary, users can use customized schema subsets that defines only those components from the JXDD that they want. This schema subset defines nothing new; everything within the subset is already defined in the dictionary. To maintain that reference to the dictionary, the schema subset is defined in the same namespace as the full JXDM schema. However, schema subsets will not be publicly available. The user will own the subset and will be responsible for maintaining it.

Three types of operations will be allowed in creating a schema subset: filters, occurrence restrictions, and defining restrictive facets. The filter allows a user to choose exactly which components to include in the subset. This is a simple include/don't include decision for the existing elements, types, and enumerations (codes). Structures cannot be modified or flattened, names and definitions cannot be changed, elements and types cannot be added, type inheritances cannot be changed, and so on. This is not to say that some of these operations cannot be done later, but they are not allowed in constructing a schema subset. This filter serves two purposes. It allows users to get rid of general elements that are not related to what they need. It also allows users to remove elements from components that are not allowed to appear for certain document transmissions. For example, a certain document regarding juvenile information may be prohibited from carrying such things as the juvenile's name. If PersonName was not removed from the schema subset, it would be possible to transmit this information.

The second type of operation allowed is restricting the number of occurrences on elements. Users can decide how many times each element may appear in their documents. This solves the problem users have with the over-inclusive nature of the dictionary. It would be impossible to get everyone to agree on a single set of occurrence restrictions so we allow each user to determine this for him/herself. These restrictions can be thought of as creating an occurrence subset. The dictionary allows elements to occur an unbounded number of times. The schema subset allows users to choose any non-negative integer number of occurrences.

The last type of operation allowed is creating restrictive facets for existing elements. Facets constrain the set of possible data values permitted for a particular field. For example, a user may wish to restrict the length of a name to 30 characters. Another user may wish to restrict the possible values of a person's age to a maximum value of 150 rather than permit any positive integer. Any facets that are added must be restrictive; they cannot allow values to occur in the subset that could not occur in the full schema. In effect, this is a subset of permissible data values.

External Schema Subsets

A smaller JXDD schema subset will result in some performance improvements, but will not be sufficient unless we can also reduce the size of the external schema imports. The external schemas are primarily the files that define the external codes outside of the JXDD, such as NCIC codes. These external schemas play a major part in the current performance problems of using the full JXDD schema. Removing unneeded codes can result in much smaller imports and much improved performance time for loading and validation. Like the subset schema, the external schema subsets will be defined in the same namespace in which their full counterparts are defined, but storage and sharing of these schemas will be the responsibility of the user.

The same set of operations available for schema subsets are available to be used here, although filtering is the only one most likely to be applicable. A user will be able to filter on elements (if they exist), types, and codes. For example, a user could remove the entire set of NCIC codes relating to vehicles by filtering out some types. A user could also filter code values by removing all of the county codes from the set of United States county codes except for the ones for his own state. Users must decide for themselves as to which filters make sense for their own purposes.

The schema subset and external schema subsets allow users to tailor the JXDD to their specific needs as long as the result is still a subset of the original. It must be noted again that nothing new can be created here. For the subsets to be valid, it must be possible to replace the subsets with the full JXDD schemas such that any XML instance that can validate against the subset will be able to validate against the full schema. The performance gains from the schema subsets are only as good as the filters the user has placed on the data. Including almost everything in the dictionary gets you little. Filtering out some of the large types and large external code sets that are not needed gets you a much smaller and faster set of schemas to work with (and validate with).

The Extension Schema

The extension schema is a schema that uses components from the subset JXDD schema. This is the place where new local components can be defined as long as they are done according to a set of specific guidelines that will promote interoperability. Permissible changes include defining new elements, types and codes, defining new types that are extensions of existing JXDD types, and creating and documenting new elements that reuse existing JXDD types.

Creating new elements, types, and codes are expected - no reference dictionary will ever be able to fully meet the extent of everyone's needs. Anonymous types however are not supported since they do not support later reuse or possible inclusion in the JXDD registry.

New local types that extend existing JXDD types give users the ability to add things to those JXDD types that they feel are missing. The benefits to type extension are that the existing components of the JXDD type are automatically included and ready for use and that type substitution is possible. Type substitution is a mechanism in XML that allows an extended type to replace a base type. A user can create a new local type that has been extended from a JXDD type, and then add the new elements. This local type can then be used in any place in an XML instance where the original JXDD type can appear. Since the new elements are local, they will not be automatically recognized by outside users, but everything that appears in the original JXDD type will be recognizable. For example, suppose a user needs a person to include mother's maiden name. If that user creates his own local PersonType without extending the JXDD PersonType, then other users would have no awareness or recognition of the new type. However, if the user creates his own local PersonType, myPersonType, and extends it from the JXDD PersonType, then other users will be able to recognize the type structure. They will not be automatically aware of the new element, PersonMotherMaidenName, but they do not lose any existing functionality. Furthermore, the structure of myPersonType could be used anywhere that called for a JXDD PersonType.

Finally, type reuse is another type of change allowed in an extension schema. If users find a JXDD type that matches the structure that they need but there is not an appropriate element name or element definition of that type, then users can create a new element and document a new element and reuse the JXDD type structure.

The user will define the namespace for the extension schema. It will not reside within the Justice namespace.

The Document Schema

The purpose of the document schema is to define the document root element and type of a document, and to define any extensions needed. Any schema that is created to represent a document should have its root type extend the Justice DocumentType or a subtype of it. This serves several purposes. It allows others to see that the local schema represents a document. If reused later or added to the registry, it is the way in which others can search for documents and have this schema return as a result. Finally, the Justice DocumentType also carries with it a set of common document metadata that may be useful.

The key difference between extension and document schemas is reusability. Extensions defined in an extension schema can be reused in multiple document schemas - rather than defining PersonLocalCode in ten difference document schemas, it would make sense to define PersonLocalCode once in an extension schema and reuse (import) it in the ten document schemas. This reduces effort and makes maintenance easier (instead of applying updates in ten different places, you can apply them once). Extensions defined in a document schema do not have as much reusability. This is appropriate for extensions that are specific to that document and will not be reused. Whether to place extensions in an extension schema and a document schema, or just to place extensions in a document schema alone is a decision for users to make based on their own requirements.

Alternatives to Schema Subsets

One alternative to schema subsets is the use of restriction. The XML Schema restriction mechanism allows users to take a type and restrict away the elements they don't need and to modify the occurrence restrictions of other elements. While this seems like an acceptable approach, it presents some problems that may not be obvious. The first problem is that if users create a schema subset by restricting the full JXDD schema, the full JXDD schema would still be imported. No benefits would be gained in loading or validation time.

The second problem is that restrictions cannot be enforced. To create a restriction, a new local type would be created based on the original JXDD type. Elements could be dropped or their number of occurrences reduced. The local schema would still have to import the full JXDD schema to do this, but using fast validation tools would make this possible. The real problem is in usage. Elements defined to be of the original JXDD type would be able to use the local restricted type in the XML instance through type substitution; however, there is no way to enforce this type substitution to occur. It would still be entirely possible, and in fact easier, for the original unrestricted JXDD type to be used. Validation would not recognize that the local restricted type should be used instead of the original JXDD type. The only way to work around this would be to create a new local element of the new locally restricted type. Validation would then enforce that the local type be used, but the element would have no connection to the JXDD and would not be understood by others to whom the schema is sent. This loses much of the benefit gained by using the JXDD - understandability. The use of restriction is not prohibited, but it offers much less in terms of performance benefits and validation support than schema subsets provide. Furthermore, in most cases restriction is not a sufficient alternative to schema subsets.

Another alternative would be to modularize the full JXDD schema into different components, as has been suggested. The problem with this is that there is no set of lines over which modularization would work and provide the benefits desired. If the full JXDD schema was divided into smaller components, the smaller components would still need to import each other because they are all interrelated. A person module would need reference to a location module, a contact information module, an organization module, a miscellaneous module for common types, and some subset of an activity module for the person subtypes. Little performance gains are made and complexity is increased.

A seemingly simpler alternative to building schema subsets would be for users to copy over only those element and type definitions that they need from the full JXDD schema into their own document schema. This approach has problems as well. If users copy over JXDD components into their document schema without putting them into the Justice namespace, then other users would not be able to recognize that those components come from the JXDD. The namespace is what identifies the common source of the components; without this, recognition is lost. Instead, if users copy over JXDD components into their own document schema and put them under the Justice namespace, then this tie to the JXDD is in name only. Usually, the structural definitions of JXDD components that are used in a local schema are imported from a definition schema in the Justice namespace. The full JXDD schema is one such definition schema - it is an official definition of the JXDD elements and types. The JXDD schema subset is another. A local copy of JXDD components in a document schema is just that - a local copy. There is no official structural definition schema against which to validate and ensure that the components appear as they should. The local document only validates against itself. There is no guarantee that components are actually from the JXDD; at best all you have is a claim. In either case, local copies with or without Justice namespace references, it would not be possible to reference and identify appropriate components as valid Justice elements and types.

The Schema Subset Generation Tool

Requirements

Let's take a look at exactly what is needed in a tool to produce a set of schema subsets. (1) The tool must allow a user to search and navigate through the full Justice dictionary. This is necessary because users will need to see what is available before they can choose which parts they want to use. (2) The tool must give users the ability to create schema subsets by adding constraints. (3) The tool should allow users to create extension and document schemas by making customizations. Notice that this is not required functionality - a base tool could be built without it but would not be capable of providing the complete set of schemas. (4) The tool must be able to generate the customized schema subsets from the user input. This requires knowledge of the dictionary, data model, and the rules for creating valid schema subsets.

Standard Commercial Registries

We do not want to recreate an existing product that could meet our needs, so let's take a look at what a commercial, off-the-shelf, ebXML-compliant registry could offer us. A commercial registry could catalogue the Justice dictionary and store metadata about it, either at a component level or a document level. A commercial registry could also give users some manner of searching and retrieving data through a user interface. These are important and necessary functionalities, but are they adequate enough to support the construction of customized schemas?

To start with, only one class of registry could be used. This would be a registry with component level granularity. Any other type of registry would be useless for our purposes. A document level granularity would mean that the registry could only store and retrieve the dictionary as a full JXDD schema. This gives users no support in accessing and customizing individual components and defeats our purpose.

Suppose we then choose a registry that has a component level granularity. It would be able to store the dictionary (a list of elements and types with definitions) piece by piece rather than lumped together in a single document. However, there is no way for any off-the-shelf registry to have knowledge of the Justice data model that the dictionary is based upon. This data model is very important - it has some relationships built into it that gives the JXDD its power and flexibility. Off the shelf, no registry would be able to utilize the JXDD to its full potential. Additionally, the registry would have no mechanism to build the schema subsets or any knowledge of how to do so.

What needs to be done now

It is apparent from the volume of comments we have received that the need for a customized schema subset generation tool is immediate. Because there is no product right now that is capable of this, it must be built. This tool should have the capabilities outlined in the requirements section above. The tool should provide a graphical user interface to allow users to search through the dictionary components, add constraints and customizations, and define customized schemas. The schema subset generation tool should take in user input and, from that input, generate a valid set of customized schema subsets, carefully formed to maintain its integrity and interoperability. The tool should then return the set of schemas to the user, who then becomes the owner of those files.

Note that since the time this page was originally written, a customized Schema Subset Generation Tool (SSGT) for the JXDD was built and has been operational since June 2004. This tool remains in spiral development and capabilities continue to be added.

Future work

Despite there not being an off-the-shelf registry product ready to meet our current needs, it might be possible for an existing registry to be modified so that it supports the full Justice data model and all of the requirements for building customized schema subsets. To start with, this would involve some research and comparison of different registry products and analysis of potential candidates to determine whether making such modifications is feasible. If so, adding awareness of the Justice data model and the capacity to build schema subsets could then be added. If it is not possible to make the necessary enhancements to a commercial registry, it becomes necessary to build a custom registry to fit the Justice data model.

After a registry is either modified or built, the back end of the schema subset generation tool will need to be changed to communicate with the registry. This allows code maintenance to be performed on the registry side and new versions of the JXDD to be handled automatically rather than forcing tool upgrades.

Will this tool be the only way to create schema subsets? No. There are other ways this could be done. One step for the schema generation tool will be to translate the user input specifying how to build the customized schemas into an XML request file or wantlist. This will happen in the background, transparent to the user. The wantlist would be sent to the registry. The registry would process the file and then generate and return the customized schema subsets. The format of the request file should be publicly available, so that others can create their own front-ends and still use the registry to produce the actual schemas. Another way to generate schema subsets would be to create and distribute a library that could perform the same functionality as the registry tool. A third way would be for users to go through the set of full schemas making restrictions and creating extension and document schemas by hand. Another might be through the use of XML Style Sheet Language (XSL). There are probably many different ways that this work could be done. The benefit of using a JXDD schema subset generation tool is that if a user specifies valid input, an appropriately and consistently formed set of customized schema subsets will be returned. Without a thorough understanding of the Justice data model, it could be very easy to unintentionally break conformance.

The Analogy

The customized schema subset approach is somewhat similar to a trip to the grocery store. The XML request file is like a shopping list, containing a list of what you want to get and of how much. The data dictionary is like the grocery store. It contains a lot of different kinds of things and has a lot in stock for each item. Everything is available to you. You look through and pick out only those things that you want. Some items in the grocery store cost more than others, just like choosing some elements will create longer performance times than others. It is up to the user to decide which items are worth the cost. Grocery stores are designed to allow each shopper to walk out of the store with bags of only the grocery items that each wants to buy. The customized schema subset approach is designed to allow each user to come away with a set of schema subsets that tailors the dictionary to their own specific needs.