Techniques for Generic Modeling Using
Oracle8 and UML
Dr. Paul Dorsey, Dulcian, Inc.
Overview
The whole process of genericization is a way of thinking about data modeling. If you think along these lines, you will be likely to find many opportunities to genericize as you are building your models. Employed intelligently, the concept of genericization can greatly decrease the cost of a project. In this paper, we will discuss the kind of thinking required to achieve the optimal genericization from trivial modifications of traditional structures to very large and complex generic structures that can be dropped into many completely different systems.
Generic modeling involves trying to make models flexible and sufficiently robust so that when new requirements inevitably arise, neither the model nor the applications have to change substantially. As we capture business rules, we build the data model to support, not only those business rules, but also any similar rules. The added robustness of generic structures allows the models to support more business rules than were gathered initially. At some level, thinking along these terms makes analysis work simpler since, rather than exhaustively gathering all business rules, we need only gather the structure of the business rules. If the model supports this flexibility, then any additional business rules will fall into one of the already existing structures. Also, generic structures result in smaller models, which are easier to understand than ones with large numbers of entities.
In this paper, we will describe a variety of techniques, some of which will be familiar and others will be complex from both a modeling perspective and from the fact that the forms to support them cannot be generated by Designer/2000 or any similar product. Even coding these structures by hand requires very skilled developers. To some extent, this is a criticism of the generic modeling approach in that it makes some models much more complex and difficult to implement. Another disadvantage of generic structures occurs when performing a legacy system redesign. The new generic structure will often be very different from the old structure, making the data migration mapping much more difficult. However, the advantages of generic modeling far outweigh the disadvantages as long as you have very skilled modelers and developers on the development team. The techniques presented in this paper will be organized from the simplest to the most complex.
Technique 1:
Genericizing a check constraint to a value list class (reference
table)
Frequently, attributes can be restricted to a list of valid values. This list can be implemented directly in the table through the use of a check constraint. The check constraint is used to restrict the values of a particular attribute to those in a given list. For example, we can restrict a gender attribute in a person table to Male/Female, or a Boolean attribute to Yes/No, etc. However, many people use check constraints for status attributes such as the status of a project, which might be Initial/Approved/In Process/Complete/Cancelled. This is normally implemented in screen applications through a radio group and we can generate elegant looking applications that work well. Unfortunately, the world can change and we may want to refine our valid project statuses for a particular project. If this occurs, then we are forced to modify not only the check constraints (which is a relatively easy thing to do) but also our applications to support the additional value(s). Often, this is often not so simple. Usually, neither modifications to the database or applications can be performed by users and require DBA and/or developer talent.
The first genericization technique involves making a design decision to replace a check constraint with a value list class. An additional advantage to doing this is that there is no way to represent a check constraint in either an ERD diagram or a UML class diagram. In either ERD’s or UML, you can, however, show reference objects or value list classes. This is not to say that value list classes should be used all of the time. There are still places where check constraints should be used. For example, the truth values of a Boolean field have been Y/N or T/F since logic systems were developed. Likewise, the list of valid genders (notwithstanding the talk show circuit) has remained similarly constant. The rule of thumb is that unless you are dealing with a list of valid values for an attribute that can never change, using a value list class rather than a check constraint is a safer and better choice. The world is an inherently dynamic place. Users will often claim that “we always use these 5 ID codes;” but it has been our experience that the model and system eventually needed to be modified to support a 6th, 7th, 8th… ID code at a later time.
Technique 2: Overloading
Reference Tables
In many cases, reference tables contain very similar information. You may end up having the same reference table appear in several different places in a model. The valid status example mentioned in the previous section is a good example here as well. “Initial/Approved/In Process/Complete and Cancelled” are common statuses for most projects or processes. Sometimes, one object may have extra possible statuses, but this list usually remains constant. We can create a single value list class for Status that can be used to support multiple structures. This can be modeled in two ways as shown in Figures 1 and 2.
Figure 1: Value list Class Example A
The second structure is more accurate since it enforces the rule that you cannot assign a project status to a system work request and vice versa. However, our perception is that the first diagramming approach makes for a less complex diagram and is adequate in this instance. In the first diagram, note that we have placed indicator attributes in the class to help identify whether or not a particular value in the value list class is appropriate for one, the other, or both Project/System Modification Requests. If you want to make the same Status value list class work for other classes, you would need to add additional indicator columns.
Figure 2: Value list Class Example B
We could also add some additional structures so that the value list class is supported by the data.
Figure 3: Data Supported Value list Class
Figure 3 shows which status values are appropriate for which types of objects. We do not believe that it is appropriate or necessary to use this structure. Adding a new structure that will have status will require a change to the data model anyway; and the only change required is a change to the data indicator variable.
You should overload value list classes whenever they share a significant number of values. You do not want to overload them if the number of values in the reference table is large. For example, some industry code tables may have up to 100,000 values. Such tables should not be overloaded since there will be performance implications.
Technique 3: Tracking
Historical Information
Frequently, in gathering requirements, users will say that they only need current information. For example, users will want to keep track of what department a person works in but not where the person used to work. Be very careful about making a decision not to track historical information. This is not a simple decision since tracking historical information in addition to single point in time information is expensive. Care must be taken in accepting point-in-time only vs. historical modeling. On a recent project, with grave misgivings and against our better judgment, we accepted the users’ assurances that we didn’t need to track historical information. This was a standalone system and the main personnel system for the organization did track employment history, which was stored elsewhere. There were no problems until we attempted to build a report showing information about a project. A simplification of the relevant data model is shown in Figure 4.
Figure 4: Tracking Point-in-time Information
We needed to track what were valid positions for a department and assign those positions to persons. Projects in this system were assigned to departments. When creating reports, we found a position for the Department called “Project Manager” assigned to the person who managed all projects. The problem arose when we needed to show who was the manager of a project. All that could be shown was the current project manager for the department. There was no way to track accountability back to the original project manager if they had moved on to another project.
The lesson learned here is that you should only limit your model to track current states of information when you are absolutely sure that you will never need to track history information. Rather than taking the user point-in-time requirement at face value, we should have recognized that historical information might also be relevant in this case. Including historical information in the model initially is not without cost, but the cost of not tracking it is unmet user requirements and the additional costs of changes to the data model and applications later on.
Technique 4: Making a
relationship into a class
There is often more than one relationship between two classes. When this occurs often between the same two classes, you should determine whether this relationship is a good candidate for genericization. A good example is the relationship between Project and Department discussed in Technique 3. After talking to a number of different users, it was decided that there should be a 1-many relationship between Department and Project. When we designed prototype applications based upon this decision and showed them to all of the users in a JAD session, we learned that the users were talking about two different types of relationships. One was the financial control of the project and the other entailed administrative control of a project. For this organization, these came from two different departments. We then needed to change the model to support two relationships as shown in Figure 5.
Figure 5: Multiple relationships between Classes
All was progressing smoothly until we discovered another relationship whereby a project was audited by a third department. At this point, we decided that genericization of the relationship between Project and Department was appropriate. We modeled the relationship as shown in Figure 6.
Figure 6: Genericizing a relationship
With this model, any relationship between Project and Department can be added without having to change the model again. In this way, we have made a relationship into a class. This technique should always be used when more than two relationships exist between classes. Even when two relationships are discovered, you should carefully assess whether there may be more.
Technique 5: Creating
Recursive Value list Classes
In many systems, it is common to need to go beyond simple typing of objects in order to implement some type of rollup structure. Often these structures are given different names and you end up with a structure similar to that shown in Figure 7.
Figure 7: Rollup Structure
What happens if users want to add “sub-sub-classes” at a future time? To avoid this situation, we use the structure shown in Figure 8 where we store, not only the types, but also the names of the structure types, in their own class.
Figure 8: Recursive Value List Class
For example, if we wanted Figure 8 to represent the same system as Figure 7, we would have three objects in the Project Structure Name class: Category, Class and Sub-Class. The objects from all three of these value list classes in Figure 7 would be stored in the Project Type class shown in Figure 8. Note that there are two business rules that we have not captured in the second diagram:
1. Sub-classes roll up to Classes which roll up to Categories.
2. Projects are only attached to Sub-Classes.
To support these two rules, we need to make changes to the diagram as shown in Figure 9.
Figure 9: Genericizing Rollup structure
The first change made is that we have implemented a linked list in the Project Structure Name table as shown in the 0..1 to 0..1 relationship. Using that relationship, we can store the rule that “Subclasses roll up to Classes, which roll up to Categories.” The “Attach to Project” indicator allows us to set a flag=Y on the word subclass to indicated that Subclass Types are the only ones that can be attached to projects.
This technique should be used any time you are trying to implement a rollup structure of objects. We tend to use this structure even if there are only two levels of rollups. It has been our experience that whenever there is a need to classify objects in a hierarchical structure, that hierarchy tends to grow. Just as with multiple relationships between objects, using a more generic structure provides the flexibility necessary to add more rollups to the system as they become needed in the future.
Technique 5: Storing
table structures as data
Traditional notions of generic modeling usually involve the storing of the whole structure of a table as data. This allows modifications to these “virtual tables” to occur through applications. Thus, users can make significant changes to these virtual tables without the support of a DBA or developer. Such virtual table structures are complex, difficult to understand and code, and lead to terrible performance if the database is large. They also require expert developers and DBAs to maintain them properly. The advantage of these structures is that they support users directly making changes to the data model. This is potentially a great benefit. To clarify this idea, we will use a very simple example that would never be implemented generically. Figure 10 shows a basic genericization for employees. Figure 11 shows the details of the full generic model.
Figure 10: Simple Employee Genericization
Figure 11: Full Generic Model
In the full generic model shown in Figure 11, Datatypes refer to the Oracle datatypes (VARCHAR2, DATE, NUMBER, etc.) Data Domains list the valid domains allowed to be used in this context. This is an idea borrowed from CASE tools to help us restrict allowable datatypes in the database. Using the data domains is essential to help simplify the development of the flexible applications necessary to support this structure. The data domains can include not only data length but also formatting, default values and check constraints. A table can also include a reference to an external PL/SQL function to support complex validation. Employee Type consists of the list of all possible types of employees – in this case, two would be included: Salaried and Hourly. Employee Attributes consist of all possible attributes used in this structure. The Valid Attributes class stores the information about which attributes are appropriate for which type of employee. The values of these attributes are stored in the Attribute Values class. The following points assist in comparing the two examples:
· The names of the specialization classes are stored in Employee Type.
· The names of the attributes in the specialization classes are stored in Employee Attributes.
· The declarations of which attributes are used in the specialization classes are stored in Attribute Values.
Therefore, a specific object from the Salaried class in Figure 10 would have each of its attributes stored as an individual object in the Attributes Values class in Figure 11.
If the number of attributes associated with different types of employees is large, there are many employees, or there is a high transaction volume, this structure becomes unusable, given current technology. For example, if there were 1000 salaried and hourly employees with 50 specific attributes to be tracked, this would result in some 50,000 objects in the Attribute Value class.
This is still not an unworkable number. However, some thought must be given to the type of system for which this might be used and the potential performance impacts caused by the genericization. This type of structure is more likely to be used in the case of a pharmaceutical testing environment where scientists are given the ability to define their own tests and decide which attributes are appropriate for each test. If we assume that there are an average of 10 attributes tracked for each test type and 1000 different tests are run each day, over the course of a year, this would generate some 2,000,000 objects in the Attribute Value class. This number of records makes performance tuning an important issue. Prior to creating a generic structure such as this, an impact analysis should be performed. Creating a generic structure for a high-volume financial system with hundreds, if not thousands, of transactions per second would be impossible given the current technology.
There are very few modelers with the skills to create effective generic structures. Even fewer of these have sufficiently skilled development teams to create the applications necessary to support such an effort. Often explaining a generic model to a novice modeler or developer can be a challenging task. However, if you are in an environment with a highly skilled team and a performance burden that is not too heavy, judicious use of generic structures can greatly reduce the size and complexity of your data model.
Technique 6: Overloading
Primary Data Classes
Rather
than having separate classes for each Division/Department in an organization, we
created a single class called Organization Unit. This is an excellent example of
overloading a primary data class. Any time a system includes the notion of
objects contained in other objects such as Department/Division/Group, it is easy
to recognize that these should be combined since the definitions of these
classes are all the same. In our Organization unit example, all different types
of Org. Units can be defined as collections of people. It is also possible to combine classes
that have quite dissimilar definitions if they have similar structures and/or
overlapping information. In a retail system we worked on, we had Sales,
Purchases, and Receipt of Goods modeled as shown in Figure
12.
Figure 12: Traditional Retail System Model
Note that all of these structures can be defined as a “movement or request for movement of some number of inventory items from one location to another. Thus, we combined all of these into a single, generic structure called “Merchandise Movement” as shown in Figure 13. Not only did this greatly reduce the complexity of the model, but, later in the process, when we uncovered previously unidentified merchandise movements such as a Sales Return, the generic model was flexible enough to accommodate the new transaction type without any changes.
Figure 13: Genericization of Retail Merchandise Movement
The next level of genericization involves reuse of the same or similar structures many times throughout the model. For example, if we have multiple kinds of comments associated with different structures, i.e. Persons may have comments concerning their job records, whereas Projects have comments attached regarding their feasibility. With either structure, it might be appropriate to have a generic remark. Traditionally, each different type of comment field is placed in to the class itself. If comments of the same type can be applied by more than one person, they must be broken down into another detail class. These structures can be completely genericized so that they can be used seamlessly in any system. For example, using the techniques discussed earlier in this paper, we transform the traditional structure shown in Figure 14 into the generic structure shown in Figure 15.
Figure 14: Traditional Comment Field Structure
Figure 15: Genericized Comment Field Structure
Notice how in the second diagram, we have not even attached Person and Project to Comment. We simply create a method called “Comment” which will create a comment for that class in the Comment structure. The rules for determining what comment types are appropriate for which structures are stored in the Valid Comment Type class. We call this type of a structure a COB (Complex Object). In its implementation, we include not only the classes and their implemented tables with appropriate triggers, but also their associated forms and reports. This allows us to reuse the structures not only within applications as shown here but also across applications.
Implementing Generic
Structures
Complex Object Development requires a highly skilled team, with extensive knowledge of PL/SQL. Stored Program Unit development (i.e. Database Triggers, packages, procedures and functions) and DBMS_SQL are just two of the critical knowledge sets that are required to develop Complex Objects.
Building Complex Objects
(COBS)
Complex objects (COBS) are not difficult to build once you have completed the first one. However, to build them, you need to understand and be able to use DBMS SQL.
These structures can be quite complex. For example, the Workflow complex object we constructed not only stores workflow acts such as approval/denial, but also stores rules about who performs such actions. It requires 10-15 tables and three multi-tab Forms modules to support it. COBS can also be very simple. For example, the Comments COB mentioned above involves little more than a single table.
The remainder of this paper will explain how to build the Comments COB. Once you see how one of these objects is built, you should be able to take the idea and build more structures on your own.
Before trying to build a COB, you should build the structure as a non-generic object. For example, in this exercise, you should first build the Comments structure in the traditional way and then modify it to be a COB. If you have done a lot of template building in the past, you will probably feel comfortable going directly to the COB environment.
The basic table structure of the simple Comment COB is the comment table and a reference table that stores information about the objects that can be commented on. If we only want to allow commenting by people with a specific role, we need a few more tables. We will discuss the structure of each of the tables in turn.
Step 1: Building the COB
reference table
We want to store the tables that will allow commenting. We need to do this so we can write database triggers to enforce the necessary pseudo-referential integrity. To do this we build a reference table and store the commentable table names in it.
The reference table (COB_REF) is used for all COBS. For this simple structure, we need only one column to store the table name. However, we want to anticipate using more than one COB so we will add a second column to indicate which tables will allow commenting.
In order to allow commenting on projects and purchase orders (at both the master and detail level), the appropriate reference table would be as shown in Table 1:
TABLE NAME |
DISPLAY NAME |
COMMENT YN |
PROJECT |
‘PROJECT: ’||Project_name |
Y |
PO |
‘PURCH ORDER: ‘ ||To_char(PO_num) ||’-‘|| to_char(approve_date,’MM/DD/YYYY’) |
Y |
PO_DTL |
‘PURCH ORDER DTL: ‘|| To_char(PO_num) ||’-‘|| Item_name(item_id) |
Y |
Table 1: COB Reference Table
If we were to have more COBs in this application, we would simply add more indicator columns to the reference table. Of course, we could use a table that stores the names of the COBs and create an intersection table between table names and COBs. This is unnecessary because we have to make code changes to add a new COB anyway. Adding a column is painless and it makes it easier to see quickly which table can use which COB.
Another useful factor in the Comments COB is to have a real name for this object to facilitate reporting from the table and to report more than the UID of the table. One way to support this is to put an extra VARCHAR2 (100) column in the comment table to store a character description of the object being commented on. For example, for Employees, use the employee name, for Inventory, use the name of the inventory item. However, if the name of the object changes, the linked comment field would not be automatically updated. To solve this problem, a display name column is added to the reference table.
This expression is usually just a column name. For the Employee table, it would be a concatenation of several columns with spaces or commas. It could also be a function of the table UID, which could return a character string, possibly requiring a query of its own.
For example, if someone were commenting on a purchase order detail line, when referencing that comment, it might be worthwhile to see the parent purchase order number, date and vendor name. This might be a query involving a multi-table join.
In the above example we have shown the following display names:
· Project – the name of the project
· Purchase Order – the PO number followed by the date the purchase order was approved (e.g. 21432 – 02/12/1998)
· Purchase Order Detail – the PO number followed by the name of the item ordered. In this case the name of the item is returned through a function.
Step 2: Building the
comment tables
In addition to a Comment table, we need a Comment Type table and an intersection table to validate which types of comments are appropriate for each table.
The COMMENT_TYPE table is a simple code description table.
VALID_COMMENT_TYPE is an intersection table between COB_REF and COMMENT_TYPE. Each row in this table represents an allowable type for a particular commentable table.
If you don’t want to go to the trouble of genericizing the code (this is a bit annoying), instead, you can have the user specify the comment type in the calling application which is then passed as an additional user parameter. However, if you build a significant quantity of applications, genericizing the code will pay off in the long run and will be well worth the time and effort.
A simple Comments table is one where a user can log comments about any topic such as budgets, purchase order, manufacturing, etc. We can genericize all of these comments within a single table. The table structure is as follows:
CREATE TABLE COMMENTS (
COMMENT_ID NUMBER (10) NOT NULL PRIMARY KEY,
CREATE_DATE DATE NOT NULL, --Date and time comment was logged.
PERSON VARCHAR2 (40) NOT NULL --Name of person logging comment.
COMMENT TYPE NUMBER (10) NOT NULL – Foreign key to the comment type table.
COMMENT_TX VARCHAR2 (2000), NOT NULL --Comment itself.
FK_TABLE VARCHAR2 (30) NOT NULL, --Name of the table on which the comment is logged (This is a foreign key to the COB_REF table).
FK_LINK NUMBER (10) NOT NULL) --Foreign key link. The key to the row on which the comment was made. (This is an overloaded foreign key that points to the table sited in the FK_TABLE column.)
/
The first four columns are all of the information we need about the comment. This table must now be linked to the object that the comment refers to. We need to make the table generic. The traditional way to do this would be by using an arc foreign key, which would require one foreign key column for every object that can be commented on. The disadvantage to this approach is that if we want to place commenting capability in a new system or extend it to new objects, data model and programmatic changes to whatever modules are supporting comments would be required.
What is needed is a generic approach to link comments to any object. To support this, we added 2 columns to the comment table:
1. The first additional column consists of the name of the table we are pointing to (FK_TABLE).
2. The second is the UID for that table (FK_LINK).
This imposes a minor limitation; namely that it forces us to use single column primary keys in anything we need to link a generic object to. You can make this work with multi-column primary keys; however, the coding is more difficult. In practice, we use system-generated numeric IDs only for columns that may require generic structures. In Oracle8, it may be possible to use the OID; but we have not been able to test this yet.
Step 3: Enforce
referential integrity
The FK_TABLE column is enforced through simple referential integrity. The FK_LINK column, however, is enforced through a BEFORE_INSERT trigger. The trigger must enforce that the UID exists in the underlying table that we are trying to comment on.
This is where we need to use DBMS SQL because the table name is being passed on the inserting row. The trigger does a lookup back to the underlying table to enforce the referential integrity.
You also need to add a BEFORE_DELETE trigger to each of the commentable tables to make sure that comments are not orphaned in the comment tables when commented on objects are deleted.
To implement this you have a few alternatives:
1. Do nothing. Creation of the comments is only through the comments application anyway. The trigger is difficult to write and is not necessary. Orphaning comments is not a big problem since it won’t happen much. Disks are cheap anyway.
2. Write the triggers by hand. As long as you are not using many COBs and don’t have too many tables accessing them, this approach may be fine. However, if you are planning on writing a lot of systems, this becomes an arduous task.
3. Write utilities to create the triggers for you. If you are doing a lot of COB work, this will save you time in the long run.
Depending upon the system, each of these may be used. Surprisingly, the first alternative is not always as silly as it sounds. We did a system where few rows were ever deleted and all access to the COBs was through the applications. After much soul searching, we took the do-nothing approach.
Costs of Generic
Implementation
While Generic implementation leads to flexible systems, it also requires a couple of sacrifices. The two most prevalent of which are the way this strategy changes the scope of data migration, and the performance implications that arise from combining similar but not homogeneous object sets into one common data structure.
Data
Migration
The costs of data migration tend to go unforeseen until far too late in the project schedule. The most common flaw in data migration planning is that too little resources are invested in it. Many project schedules will list data migration as a single task in a development effort, although by the end of the project, it will be evident that data migration was indeed an entirely separate project that was highly dependent upon several of the deliverables of the development project.
Therefore, it is very important to plan ahead for data migration in any project. This point is magnified when genericization is the major thrust of the development effort. The key to this issue is that data mapping, or the process by which we gather the rules to transform one data set into a heterogeneous set of data structures, is heavily dependent upon the database design of the target system. As the target database design evolves, so must the data mappings.
Implementing generic data structures causes significantly more complex migration issues than traditional systems design. For example, a legacy system may contain two separate structures that store employee and customer information. We may choose to genericize these two structures into one structure called PERSON, where we indicate whether a given person is an employee, customer, or perhaps both.
Collisions
Collisions, or duplicate unique identifiers, typically do not occur in isolated data structures. However, when you extract the data from two independent data structures and merge them into one common data structure, all sorts of delightful things start to happen. The legacy system may have used an internal numbering scheme to uniquely identify customers and employees. There was no measure preventing the creation of both a customer and an employee, each being identified by ID=’1’. The reason this was not previously an issue was because the two data sets were stored separately.
Merging these two data sets however requires the selection of one of two alternatives. The first is to generate a new unique identifier for every record inserted into the generic structure (i.e. employee and customer records will receive new UID’s). This approach is only acceptable when the users of the system do not refer to these objects by their ID’s.
The other alternative would be to add an indicator field called PERSON_YN to the PERSON table, which would indicate whether each object in the PERSON class was either a customer or an employee.
Performance
The second major implementation issue arising from generic design is the performance impact of consolidating multiple data structures into fewer structures with substantially more data. Less tables will mean less database applications accessing the same tables for multiple purposes. This inevitably leads to significant increases in I/O access of certain genericized tables.
Partitioning can help resolve this problem. Simply partition the table in the same manner as you would have created separate tables (i.e. two partitions – CUSTOMER and EMPLOYEE). In this way, your database requires one less table and you will not be sacrificing performance. Unfortunately, this option is not available if you plan to implement using Oracle7.
Conclusion
Genericizing as much of a system as possible has both advantages and drawbacks. Determining what can and cannot be genericized and how best to model these structures requires a great deal of skill. Also, once modeled, implementing the generic structures takes an experienced development team. Performance considerations must always be taken into account as well. However, even with these negative aspects, the benefits of genericizing where appropriate are numerous. Genericized models are easier to read and provide the flexibility to cope with changes as they arise without making major changes to the data model or the applications based upon the generic structures. Systems designed generically will have a longer lifespan and ultimately reduce the time and resources needed to keep up with ever-changing data and business rules.