The Complete Data Migration Methodology
Joseph R. Hudicka, Dulcian, Inc.
Overview
Most software implementation efforts are conducted to satisfy one of the following initiatives:
Each type of system may be replacing and/or enhancing functionality currently delivered by one or more legacy systems. This sort of systems evolution means that organizations are working to grow at or near the pace that the ever-changing world of technology dictates. Choosing a new technological direction is probably the easiest task in accomplishing the entire effort.
Complications arise when we attempt to bring together the information currently maintained by the legacy system(s) and transform it to fit into the new system. We refer to the building of this bridge between the old and new systems as data migration. Data migration is a common component most systems development efforts.
One would think that any two systems that maintain the same sort of data must have been performing very similar tasks. Therefore, information from one system should map to the other with ease. This is hardly ever the reality of the situation. Legacy systems have historically proven to be far too lenient with respect to enforcing integrity at the atomic level of data. Fields that should be populated from a list of valid values, such as STATES, tend to require that a value be entered, but seldom validate the value entered by the user.
Another common problem has to do with the theoretical design differences between hierarchical legacy systems and relational systems. Two of the cornerstones of hierarchical systems, namely de-normalization and redundant storage are strategies that make the relational purist cringe.
The most significant problem with data migration projects is that people really do not understand the complexity of data transformation until they have undergone a number of arduous migration projects.
Having made these points, it is obvious to this author that there is a desperate need for a sound methodological approach with which organizations can tackle migration projects. Although there is no way to avoid unpleasant surprises in data migrations, we can certainly be prepared to confront and resolve them.
The Phases of Data Migration
The ideal data migration project plan should be broken down into phases, which mirror the overall project development phases. These phases consist of:
Each of these phases will be defined in further detail in the following sections. Some may choose to argue about the necessity of dividing the migration process into the finite levels of the overall project phases. As you proceed through this paper, it will become apparent that each of these sub-divisions requires critical milestones to be achieved. These milestones mark strategic points along the project’s timeline. Any successes or failures in a given phase will significantly impact the outcome of the entire project.
Pre-Strategy
The Pre-Strategy phase is usually the clearest part of the migration project planning process. In this phase, the focus of the overall project is determined. It is important to note that data migration projects do not happen independently. Rather, they are spawned from other development efforts such as the implementation of new OLTP’s and/or OLAP’s. This is where the first fundamental mistake generally occurs. The project manager is clearly focused on determining the requirements that the new system must satisfy, and pays little or no attention to the data migration(s) that must occur. It is quite common to review a project plan for a new system and discover that data migration is merely listed as a single task item, if at all. For those working with such project plans, in hindsight, you will discover that a separate project plan entirely devoted to the data migration effort should have been included.
The focus of the Pre-Strategy phase is to determine the scope of the migration. In other words, answering the question, ‘What we are trying to migrate?’ This is the time to identify the number of legacy systems requiring migration and a count of their data structures. Interfaces are another critical factor that should be identified at this juncture. Interfaces are no different from other data sources, except they may receive data from the new system, as well as supplying data.
At this point, we are not only identifying the number of data files, but also the number of different systems from which data will be migrated. In the case of OLAP systems, multiple data sources are a common occurrence. Data sources are not limited to actual data processing systems. Inevitably, one will find employees that maintain files on their own workstations that they use to accomplish tasks that cannot be supported by their existing systems. Word processing documents, spreadsheets, desktop RDBMS packages, and raw text files are just a few examples of data sources you can expect to uncover in the early phases of data migration.
Strategy
The Strategy phase of data migration should be scheduled to occur concurrently with the Strategy phase of the core project (i.e. OLTP, OLAP). Unfortunately, in most cases, the same people are expected to perform both analyses. This can be done, but these people need a clearly defined set of tasks for which they are responsible.
The focus of the Strategy Phase is to determine whether or not the objectives documented in the Pre-Strategy phase are achievable. This involves examining the actual data you plan to migrate. Remember, at this point in the project, you have no idea if the data is even of high enough quality to consider migrating. In order to get a better sense, it is helpful to obtain reports that can provide row counts, column counts and other statistics pertaining to your source data. This kind of information gives you a rough idea of just how much data there is to migrate. You may find that the overall cost of migration is prohibitive relative to the quantity of data that needs to be moved.
If this occurs, the most common solution is to migrate the source data to the new platform into data structures that are constructed identically to that of the source system. Doing so allows you to shut down the old system and bring up the new one with confidence and without losing historical data. The cost of developing a set of reports to access the historical data on the new platform tends to be far cheaper than the cost of migration in nearly every scenario.
The milestone of the Strategy phase is the Data Migration Strategy Document, which outlines the intentions of the overall migration effort. In this document, we outline the reasons for our conclusions about whether data migration is worthwhile. The data quality research performed in this phase is still at a very high level, and in no way suggests that the team has gained a thorough understanding of the specific data cleansing issues they will face later in the project.
Pre-Analysis
The Strategy phases were geared towards the development of a precise definition of the migration project’s goals. The Pre-Analysis phase expands on the information gathered in Strategy, and we can determine the procedures that will result in the migration plan.
By the Pre-Analysis phase, we already understand what needs to be done. Now, we must determine who will perform these tasks. We also must perform more comprehensive data cleansing research, extending the knowledge of our findings in the Strategy phase. We built a set of canned data quality ? reports, which can be useful in the Strategy phase. At this point in the Pre-Analysis phase, real developers must begin rolling up their sleeves and proceed to write and execute SQL statements against legacy data.
On the first day of the Pre-Analysis phase, someone should be appointed to port the legacy data from its current environment into an Oracle staging area. A staging area is merely a database account that contains tables that essentially replicate the data structures of the legacy system. One benefit of porting the uncleansed legacy data to the Oracle environment is the ability to have your legacy expert staff begin working with Oracle very early on in the project development stage, thus spreading the learning curve out as much as possible.
After you have decided upon the legacy data sources and have conducted thorough data research, you must begin the candidate data element selection process. This involves going through the list of data elements from each and every source data structure, and deciding whether or not each one must be migrated. You may begin this process while in the Analysis phase of the project.
Analysis?
The Analysis? phase of data migration also happens in parallel with the Analysis phase of the core project. This is because each data element identified as a candidate for migration inevitably results in a change to the data model.
The Analysis phase is not intended to thoroughly identify the transformation rules by which historical data will be massaged into the new system; rather, it involves making a checklist of the legacy data elements that we know must be migrated. This list of data elements will have three sources:
Pre-Design?
In the later stages of the Pre-Design phase, attribution of the data model will have been completed, and we will be ready to generate the physical database design. The attribution process will be fed from legacy data analysis, legacy report audits, and user feedback sessions. This is essentially the start of data mapping. Mapping must occur to physical data structures because entities do not contain attributes that their succeeding tables will contain because they are components of foreign keys. The mapping of foreign key columns is extremely important, and is easily overlooked when attempting to map from physical to logical structures.
Design
The Design phase is where the bulk of the actual mapping of legacy data elements to columns takes place. The physical data structures have been frozen, offering an ideal starting point for migration testing. Note that data migration is iterative – it does not happen in a single sitting.
The mapping portion of a data migration project can be expected to span the Design phase through Implementation. The reason for this is quite simple. The most important resources for validating the migration are the users of the new system. Unfortunately, they will be unable to grasp the comprehensiveness of the migration until they view the data through the new applications. We have concluded from experience that developing the new reports prior to new forms allows for more thorough validation of the migration earlier on in the project lifespan. For instance, if some sort of calculation was performed incorrectly by a migration script, reports will reflect this. A form typically displays a single master record at a time, whereas reports display several records per page, making them a better means of displaying the results of migration testing.
A popular misconception about data mapping is that it can be performed against logical data models. Unfortunately logical data models represent attributes that arise through relationships without defining the attributes in the child entity. This essentially means that you cannot map any of the connections between data structures while you are working with the logical design. Therefore, it is necessary to perform data mapping to the physical data model.
With the physical data structures in place, you can begin the mapping process. Mapping is generally conducted by a team of at least three people per core business area (i.e. Purchasing, Inventory, Accounts Receivable, etc.). Of these three people, the first should be a business analyst, generally an end user possessing intimate knowledge of the historical data to be migrated. The second team member is usually a systems analyst with knowledge of both the source and target systems. The third person is a programmer/analyst that performs data research and develops migration routines based upon the mappings defined by the business analyst and the systems analyst, cooperatively.
Pre-Test/Test/Implementation
Since testing and implementation are practically inseparable, they are often combined into one phase. The Pre-Test phase deals with two core subject areas: logical errors and physical errors. Physical errors are typically syntactical in nature and can be easily identified and resolved. Physical errors have nothing to do with the quality of the mapping effort. Rather, this level of testing deals with semantics of the scripting language used in the transformation effort.
The Test phase is where we identify and resolve logical errors. The first step is to execute the mapping. Even if it is successfully completed successfully, we must still ask questions such as:
The truest test of data mapping is providing the populated target data structures to the users that assisted in the analysis and design of the core system. Invariably, the users will begin to identify scores of other historical data elements to be migrated that were not apparent to them during the Analysis/Design phases.
The fact is that data mapping often does not make sense to most people until they can physically interact with the new, populated data structures. Frequently, this is where the majority of transformation and mapping requirements will be discovered. Most people simply do not realize they have missed something until it is not there anymore. For this reason, it is critical to unleash users on the populated target data structures as soon as possible.
The data migration Testing phases must be reached as soon as possible to ensure that it occurs prior to the Design and Build phases of the core project. Otherwise, months of development effort can be lost as each additional migration requirement slowly but surely wreaks havoc on the data model, which, in turn, requires substantive modifications to the applications. .
The measures taken in the Test phases are executed as early as the Design phase. Testing is just as iterative as the migration project itself, in that every enhancement must pass the test plan.
Revise
The Revise phase is really a superset of the last four phases (Pre-Test, Test, Implement and Maintenance), which are iterative and do not take place independently. This is the point in the process where cleanup is managed. All of the data model modifications, transformation rule adjustments, and script modifications are essentially combined to form the Revise phase.
At this point, the question must be asked: Are both the logical and physical data model being maintained? If so, you now have doubled the administrative workload for the keeper of the data models. In many projects, the intention is to maintain continuity between the logical and physical designs. However, because the overwhelming volume of work tends to exceed its perceived value, the logical model is inevitably abandoned, resulting in inconsistent system documentation.
CASE tools can be used to maintain the link between the logical and physical models, though it is necessary for several reports to be developed in house. For example, you will want reports that indicate discrepancies between entities/tables and attributes/columns. These reports will indicate whether there is a mis-match between the number of entities versus tables and/or attributes versus columns, identify naming convention violations and seek out data definition discrepancies. Choose a CASE tool that provides an API to the meta-data, because it will definitely be needed.
Maintenance
The Maintenance phase is where all of the mappings are validated and successfully implemented in a series of scripts that have been thoroughly tested. The maintain phase differs depending upon whether you are migrating to an OLTP or an OLAP system.
If the migration is to an OLTP system, you are working within the ‘one and done’ paradigm. You goal is to successfully migrate the legacy data into the new system, rendering the migration scripts useless when the migration has been accomplished.
If the migration is to an OLAP system, you will most likely be reloading the new system in timely intervals. As new information is recorded in the OLTP system, you will need to transfer it to the OLAP system. Script performance is a critical issue in OLAP migrations, while OLTP migrations pay little or no attention to script performance since they will only be run once.
Data Transformation Tool vs. Programmer
How does an organization decide whether to use a data transformation tool to manage its data migration effort, or simply employ a group of willing and able technical experts to accept the bulk of the workload?
Typically, the decision is made based upon the comparison of the cost of a transformation tool and the cost of employing a few extra pairs of hands. Can you guess which option a manager that had data migration listed on the project plan as one task opts for more often?
This is not to say that the purchase of a transformation tool is the only solution. However, I do believe that many people do not truly understand all of the benefits that a data transformation tool can deliver.
Data Transformation Tool Features
A data transformation tool offers a central repository to support the storage and ongoing maintenance of data mappings, lending itself to teams of all sizes. The most commonly used alternatives tends to be spreadsheets, which invariably fail when you need to map one source data element to one or more target elements. Spreadsheets typically prevent more than one person from making modifications at one time, resulting in a great deal of unnecessary administrative overhead. Spreadsheets are two-dimensional tools, and mapping is without question multi-dimensional to the nth degree.
A good data transformation tool should include the following features:
Manual data migration
The manual approach to data migration is still a valid method. In fact, there are many small data migration projects that have very little data to migrate, and as a result do not require a major investment in a data migration tool or large amounts of effort by one or more programmers.
Decision factors in selecting a data transformation tool
Inevitably, a decision about whether to perform data migration manually or purchase a data transformation tool must be made. Though the cost of a transformation tool is indeed recognized up front, most project leaders simply cannot foresee the complexity of the data migration aspects of a systems development project. This is an observation that I have found to be true at several organizations around the country.
A typical data migration project, assuming one legacy data source and a target system of 400 tables requires a minimum three person team, as described earlier in the paper. With a team of this size, the data migration will likely take about 6 months. However, with the use of a transformation tool, the overall project timeline can be cut in half.
Conclusion
Data Migration is a necessary evil, but not an impossible one to conquer. The key is to prepare for it very early on, and monitor it carefully throughout the process. Project timelines tend to become more rigid as time passes, so it really makes sense to meet migration head on. A devoted team with a clearly defined project plan from project inception armed with automated tools where applicable, is indeed the formula to success.
About the Author
Joseph R. Hudicka is the CEO of Dulcian, Inc (
www.dulcian.com ). He is a member of the Oracle8 Customer Advisory Council providing insight regarding future enhancements to the Oracle8 Server technologies. He is a systems design specialist for Decision Support (OLAP) and Online Transactional Processing (OLTP) systems with specific business knowledge in finance, health care, pharmaceutical, and manufacturing areas. Mr. Hudicka is the chief architect of DataMIG™ (Patent Pending), a complete data migration management tool suite. He delivers presentations regularly at major Oracle conferences and local Oracle user groups. Joe can be contacted at jhudicka@dulcian.com.© 1999 Dulcian, Inc.