An Overview of Data Migration Methodology
Joseph R. Hudicka
Dulcian, Inc.
Most software implementation efforts are conducted to satisfy one of the following initiatives:
Each type of system may be replacing and/or enhancing functionality currently delivered by one or more legacy systems. Regardless of the type of project, some data conversion must take place. Complications arise when we attempt to take the information currently maintained by the legacy system and transform it to fit into the new system. We refer to the building of this bridge as data migration. Data migration is a common element among most system implementations. It can be performed once as with a legacy system redesign or may be an ongoing process as in a data warehouse. Even some legacy system migrations require ongoing data conversion if the incoming data requires continual cleansing.
One would think that any two systems that maintain the same sort of data must be doing very similar things and therefore should map from one to another with ease. This is hardly ever the reality of the situation. Legacy systems have historically proven to be far too lenient with respect to enforcing integrity at the atomic level of data. Fields that should be populated from a list of valid values, such as STATES, tend to require that a value be entered, but seldom validate the value entered by the user.
Another common problem has to do with the theoretical design differences between hierarchical and relational systems. Two of the cornerstones of hierarchical systems, namely de-normalization and redundant storage are strategies that make the relational purist cringe.
The most significant problem with data migration projects is that people really do not understand the complexity of data transformation until they have undergone a number of arduous migration projects.
For all of the reasons mentioned above, it is obvious to this author there is a desperate need for a sound, methodological approach by which organizations can tackle migration projects. Although there is no way to avoid unpleasant surprises, one can certainly be prepared to confront and resolve them.
The 7 Phases of Data Migration
The ideal data migration project plan should be broken down into 7 phases, which mirror the overall project phases. These phases are known as Strategy, Analysis, Design, Build, Test/Implement, Revise and Maintain. Each of these phases will be defined in further detail in the following sections.
Strategy
The strategy phase is usually the easiest part of the project planning process. In this phase, the focus of the overall project is determined. It is important to note that data migration projects do not happen independently. Rather, they are spawned from other development efforts such as the implementation of new OLTP’s and/or OLAP’s. This is probably where fundamental mistake # 1 generally occurs. The project manager(s) is clearly focused on determining the requirements that the new system must satisfy, and pays little or no attention to the data migration(s) that must occur. It is quite common to review a project plan for a new system and discover that data migration is merely listed as a single task item, if at all. In hindsight, those working under such project plans will discover that a separate project plan entirely devoted to the data migration effort should have been included.
Analysis
The Analysis phase of data migration should be scheduled to occur concurrently with the Analysis phase of the core project (i.e. OLTP, OLAP). Unfortunately, in most cases the same people are expected to perform both analyses. This can be done; but these people need a clearly defined set of tasks for which they are responsible.
For example, the aim of the Analysis phase in data migration projects is to identify the data sources that must be transported into the new system. Notice I stated data sources. Not the number of data files, but the number of systems from which data will be migrated. In the case of OLAP systems, multiple data sources are a common occurrence.
Data sources are not limited to actual data processing systems, either. Inevitably, one will find employees that maintain files on their own workstations that they use to accomplish tasks that cannot be performed by their existing systems. Word processing documents, spreadsheets, desktop RDBMS packages and raw text files are just a few examples of data sources you can expect to uncover in the Analysis phase.
The next important part of the Analysis phase involves getting acquainted with the actual data you plan to migrate. Remember, at this point of the project, you have no idea if the data is even of high enough quality to consider migrating. In order to get a better sense, it is helpful to obtain reports that can provide row counts as well as column counts and other statistics pertaining to your source data. This kind of information gives you a rough idea of just how much data there is to migrate. You may find that the overall cost of migration is prohibitive relative to the quantity of data that needs to be moved.
If this occurs, the most common solution is to migrate the source data to the new platform into data structures that are constructed identically to that of the source system. Doing so allows you to shut down the old system and bring up the new one with confidence and without losing historical data. The cost of developing a set of reports to access the historical data on the new platform tends to be far cheaper than the cost of migration in nearly every scenario.
Design
After you have decided upon the legacy data sources and have conducted thorough data analysis, you must begin the roster selection. This involves going through the list of data elements from each and every source data structure, and deciding whether or not each one must be migrated. You may begin this process while you are performing Analysis, but the Design phase is where the actual selections take place.
The design phase of data migration also happens in parallel with the Analysis phase of the core project. This is because each data element identified as a candidate for migration inevitably results in a change to the data model.
The Design phase is not intended to thoroughly identify the transformation rules by which historical data will be massaged into the new system; rather, it is essentially the act of making a checklist of the legacy data elements that we know must be migrated.
Build
The build phase of migration coincides with the Design phase of the core project. The first task of the Build phase is the generation of the new data structures and their creation within the database.
A popular misconception about data mapping is that it can be performed against logical data models. Unfortunately logical data models represent all attributes that arise through relationships without defining the attributes in the child entity. This essentially means that you cannot map any of the connections between data structures while you are working with the logical design. Therefore, is it necessary to perform data mapping to the physical data model.
With the physical data structures in place, you can begin the mapping process. Mapping is generally conducted by a team of at least three people per core business area (i.e. Purchasing, Inventory, Accounts Receivable, etc.). Of these three people, the first should be a business analyst, generally an end user possessing intimate knowledge of the historical data to be migrated. The second team member is usually a systems analyst with knowledge of both the source and target systems. The third person is a programmer/analyst that performs data research and develops migration routines based upon the mappings defined by the business analyst and the systems analyst, cooperatively.
Testing/Implementation
Since these two phases are practically inseparable, testing and implementation are often combined into one phase. Testing breaks down into two core subject areas: logical errors and physical errors. Physical errors are typically syntactical in nature and can be easily identified and resolved. Physical errors have nothing to do with the quality of the mapping effort. Rather, this level of testing is dealing with semantics of the scripting language used in the transformation effort.
Implementation is where we identify and resolve logical errors. The first step is to execute the mapping. Even if the mapping is completed successfully, we must still ask questions such as:
The truest test of data mapping is providing the populated target data structures to the users that assisted in the analysis and design of the core system. Invariably, the users will begin to identify scores of other historical data elements that must be migrated that were not apparent to them during the Analysis/Design sessions.
The fact is that data mapping often does not make sense to most people until
they can physically interact with the new, populated data structures.
Frequently, this is where the majority of transformation and mapping
requirements will be discovered. Most people simply do not realize they have
missed something until it is not there anymore. For this reason, it is critical
to unleash them upon the populated target data structures as soon as
possible!
The data migration Testing/Implementation phase must be reached as
soon as possible to ensure that it occurs prior to the design and building
phases of the core project. Otherwise, months of development effort can be lost
as each additional migration requirement slowly but surely wreaks havoc on the
data model which, in turn, requires substantive modifications to the
applications built upon the data model.
Revise
The Revise phase was hinted at in the Testing/Implementation phase, above. The revision phase is where cleanup is managed. Each and every data model modification, transformation rule adjustment, and script modification are essentially combined to form the Revise phase.
At this point, the question must be asked: Are both the logical and physical data model being maintained? If so, you now have doubled the administrative workload for the keeper of the data models. In many projects, the intention is to maintain continuity between the logical and physical designs; but the overwhelming volume of work tends to exceed its perceived value, the logical model is inevitably abandoned, resulting in inconsistent systems documentation.
CASE tools can be used to maintain the link between the logical and physical models, though this tends to require several reports to be developed in house. For example, you will want reports that indicate discrepancies between entities/tables and attributes/columns. These reports will indicate whether there is a mismatch between the number of entities versus tables and/or attributes versus columns, identify naming convention violations and seek out data definition discrepancies. It is sensible to select a CASE tool that provides an API to the meta-data, because you will most certainly need it!
Maintain
The maintain phase is where all of the mappings are validated and successfully implemented in a series of scripts that have been thoroughly tested. The maintain phase differs depending upon whether you are migrating to an OLTP or an OLAP system.
If the migration is to an OLTP system, you are working within the ‘one and done’ paradigm. You goal is to successfully migrate the legacy data into the new system, rendering the migration scripts useless when the migration has been accomplished.
If the migration is to an OLAP system, you will most likely be reloading the new system in timely intervals. As new information is recorded in the OLTP system, you will need to transfer it to the OLAP system. Script performance is a critical issue in OLAP migrations, while OLTP migrations pay little or no attention to script performance since they will only be run once.
Data Transformation Tool vs. Programmer
How does an organization decide whether to use a data transformation tool to manage their data migration effort, or simply employ a group of willing and able technical experts to accept the bulk of the workload?
Typically, the decision is made based upon the comparison of the monetary cost of a transformation tool and the cost of employing a few extra pairs of hands. Can you guess which option a manager that had data migration listed on the project plan as one task opts for more often?
This is not to say that the purchase of a transformation tool is the only solution. I do believe however, that many people do not truly understand all of the benefits that a data transformation tool can deliver.
Data Transformation Tool Features
A data transformation tool offers a central repository to support the storage and ongoing maintenance of data mappings, lending itself to teams of all sizes. The most commonly used alternative tends to be spreadsheets, which invariably fail when you need to map one source data element to one or more target elements. Spreadsheets typically prevent more than one person from making modifications at one time, resulting in a great deal of unnecessary administrative overhead. Spreadsheets are two-dimensional tools, and mapping is without question multi-dimensional to the nth degree.
A good data transformation tool offers the following features:
Flexible Reporting
Reporting is easily accomplished, and serves as a comprehensive communication device between the systems and business analysts.
Code Generation from Mapping Rules
Systems analysts can generate the migration scripts directly from the mappings they have defined, reducing the overall programming staff requirements.
Script Scheduling Utilities
Some tools provide far more sophisticated options such as script compilation and scheduling utilities. Scheduling utilities can cut the overall script execution time of a migration project by more than one third in almost every case, by processing scripts in parallel as opposed to sequentially.
Automatic Integrity Violation Detection
Another important and useful feature of a data transformation tool is the automatic detection of integrity violations. On one migration project, this author uncovered 4.5 million data integrity violations. It certainly was not humanly possible to manually identify all of these violations in months, let alone days. However, I was able to develop a utility to automatically detect these violations, and had the results within four days!
Programmer
The manual approach to data migration is still a valid method. In fact, there are many small data migration projects that have very little data to migrate, and as a result do not require a major investment in a data migration tool or large amounts of effort by one or more programmers.
Decision
Inevitably, a decision about whether to perform data migration manually or purchase a data transformation tool must be made. Though the cost of a transformation tool is indeed recognized up front, most project leads simply cannot foresee the complexity of the data migration aspects of a systems development project. Unfortunately, this is an observation that I have found to be true at several organizations around the country.
A typical data migration project, assuming one legacy data source and a target system of 400 tables requires a minimum three person team, as described earlier in the paper. With a team of this size, the data migration will likely take about 6 months. However, with the use of a transformation tool, the overall project timeline can be cut in half!
Conclusion
Data Migration is a necessary evil, but not an impossible one to conquer. The key is to prepare for it very early on, and continue to monitor data migration throughout the life of the project. Project timelines tend to become more rigid as time passes, so it really makes sense to meet migration head on. A devoted team with a clearly defined project plan from the inception of the project, armed with automated tools where applicable, is indeed the formula to success!