PL/SQL Collections: Processing Datasets Your Way
Introduction
Over the last few years there have been many conference
presentations about how different “new” features can solve all of your development
problems at once. One of the most popular “silver bullets” was to use
collections instead of basic datatypes. Despite a lot of widely available
sources about how to use different collection types, there is not a lot of
information about WHY this is a good idea. Therefore, this paper will not include
pages of material from Oracle documentation or any other available books
(including my own
About Collections
The following is a definition/description of what a PL/SQL collection is:
· Ordered group of elements all of the same type, addressed by a unique subscript
· Defined as datatypes, since all collections represent data
· A special mechanism to articulate and manipulate sets of data
The last bullet contains the key word that prevents collections from being more widely used than they are, specifically, the difference between working with rows vs sets. Unfortunately, the majority of IT professionals are much more comfortable with “one-row-at-a-time” operations. To be fair, there is a list of good reasons to think this way:
· Legacy languages since this used to be the only option
· Influence of the Internet – the majority of web-based applications are built so that one row of data usually represents one screen (especially edit screens)
· Human psychology – The more readable/understandable modules are often more popular than better-performing modules that require additional effort to understand, let alone debug or modify
Unfortunately, the volume of data to be processed by contemporary database systems grows faster than the cost of hardware can drop to compensate by providing better performance, especially if developers are unwilling to update their techniques to the 21st century (in particular, server-side developers). The logic is clear:
1. All data-centric applications are built around different SQL operations.
2. SQL is not a procedural language, it is a SET language.
3. The most efficient way of communicating with a SET language is by using another SET language; otherwise the system incurs the extra cost of transforming SETs into rows and reversing rows back into SETs
If server-side developers would start thinking in SETs rather than in ROWS, the whole concept of collections becomes natural from the conceptual point of view.
Technical challenges
Before describing a real example of using collections, there are a few hidden traps to be aware of. If the system starts to operate with sets, this means that at any point in time, there much more memory must be allocated than is needed with the one-row-at-a-time approach. This leads to the following side-effects:
· If you are unfortunate enough to still be using 32-bit architecture, there is a hard limit to the amount of memory that an Oracle instance could use. (In most cases, 2 GB could be tweaked up to 3GB). As a result:
· Bringing tens of million rows to the collection will not just bring down the session, but may also bring down the whole server. Thus, the cost of the developer’s mistake increases significantly.
· For less drastic cases (since it is fairly difficult to kill the whole database) the problem could represent itself by worsening performance, because the cost of operating large data sets in memory is not zero. At some point, that extra cost could be greater than what is gained from SET operations.
· Because of such inefficient implementations, you may find the following information in many books and white papers: “Processing data in sets is usually faster than doing so one element at a time.” That word usually should be translated into a more accurate phrase: “unless you know what to do, results are not guaranteed” (or something between “caveat emptor” and “praemonitus praemunitus”).
· Even 64-bit architecture makes memory restrictions negligible. There is always a cost to getting additional memory for all servers in the development cycle. Considering recent cuts in many IT budgets, it might be challenging to justify any extra spending to management without immediate ROI.
Processing SETs
The Oracle RDBMS currently supports the following kinds of collections:
1. V-Arrays…
A. Consist of an array of elements of the same type with sequential numbers as a subscript with the size of the array specified at creation
B. Could be defined both as PL/SQL variables and SQL user-defined types
C. Are most efficient when the size of the collection is well-known; otherwise the extra overhead of worrying about the array size makes that data type less usable.
2. Nested tables…
A. Consist of an arbitrary group of elements of the same type with sequential numbers as a subscript and an unknown number of elements
B. Could be defined as both PL/SQL variables and SQL user-defined types
C. Are the most efficient mechanism of communicating between SQL and PL/SQL. However, they can be heavily abused in conjunction with object-oriented features where the solutions look fine logically, but have major performance problems.
3. Associative arrays (formerly known as PL/SQL tables)…
A. Consist of a collection of elements that use arbitrary numbers and strings for subscript values
B. Consist of PL/SQL only
C. Are mostly used for performance optimization, and are usually the only kind of collections known to developers.
NOTE: This paper’s target audience is database professionals who maintain and/or extend existing systems. That is why no 11g features will be mentioned since there are currently very few Oracle 11g systems in production.
Bulk operations
The first question that comes to mind when researching SET operations is how to efficiently communicate between SQL data sets and collections. First, you need to have sets to start manipulating them.
Bulk Collect
There is a single command that reads SQL set into collections, but it has 3 flavors as shown here:
select …
bulk collect into Collection
from Table;
update …
returning …
bulk collect into Collection;
fetch Cursor
bulk collect into Collection
[limit N];
The first two are the most widely used and shown as examples. However, from my experience, the most critical one is the last on that includes an optional LIMIT-clause. The biggest danger of using collections is running out of memory. Using “blind” bulk collect commands leaves the code unprotected against drastic changes in data volume. The worst part is that it can happen years later and be very difficult to debug. Therefore, unless there is 0% possibility of data growth, the most efficient way of processing large data sets is as follows:
declare
type emp_nt is table of emp%rowtype;
v_emp_nt emp_nt;
cursor c_emp is select * from emp;
begin
open c_emp;
loop
fetch c_emp
bulk collect into v_emp_nt limit 1000;
p_proccess_rows (v_emp_nt);
exit when c_emp%NOTFOUND;
end loop;
close c_emp;
end;
Instead of hoping that the system will not ever run out of memory, the code should be structured in such a way that at any point in time, it will try to get only a limited number of objects. Performance-wise, it is more efficient than processing the whole set (even assuming sufficient memory). But the size of a “chunk” should be defined individually for every system. From the author’s experience, the range should be between 100 and 10,000 objects per cycle. However, there are too many variables (memory, CPU, workload, etc.) to create any formal algorithm.
FORALL
The FORALL clause allows a somewhat reversed operation, namely firing SQL statements based on the collection. This command is misunderstood by many people. The following is a basic example:
declare
type number_nt is table of NUMBER;
v_deptNo_nt number_nt:=number_nt(10,20);
begin
forall i in
v_deptNo_nt.first()..v_deptNo_nt.last()
update emp set sal=sal+10 where deptNo=v_deptNo_nt(i);
end;
On the surface everything looks straightforward: Fire an update for all department IDs stored in the collection. But in many development environments, there may not be any performance improvements in comparison to the regular FOR-loop with updates inside. Therefore many developers may stop using this feature because they miss the problem that FORALL is trying to solve. It is not the speed of SQL, but the context switch between SQL and PL/SQL that happens every time the code goes from a SQL operation to PL/SQL and vice versa. This means bringing appropriate environments, classes, memory mechanisms, etc. into memory. This context switch is mainly instantaneous, but if you execute a FOR-loop with 1000 rows, this entails 2000 switches, resulting in CPU usage.
The more CPU-bound the system is, the more performance improvements can be gained by implementing FORALL However, the majority of development environments do not have a real CPU-load, only production ones. The lesson to learn is that some effects of SET operations can only be achieved with increased processing volumes.
TABLE clause
There is also a real “reverse” operation that allows direct transformation of several kinds of collections to SQL rowsets. The TABLE clause is shown here:
select [tab.]id_nr, [tab.]display_tx
from table(
[cast(]
f_getLov_nt
('emp',
'empno',
'ename||''-''||job',
'ename')
[as lov_nt)]
) [tab]
This code does the following:
· Takes as input a collection of a type either “nested table” or “v-array”. The collection should be a SQL user-defined datatype. In the example, it is a nested table defined as following:
Create type lov_oty is object (id_nr NUMBER, display_tx VARCHAR2(256));
Create type lov_nt as table of lov_oty;
· Detects attributes on the underlying datatype of the collection and makes them SQL columns
· Even though in 10g most of the time, Oracle can automatically detect the type of the passed collection, it is strongly recommended that you enforce the datatype using a CAST clause. Even though this clause is usually used to change the datatype, in this case, it explicitly defines what kind of structure should be expected.
· Creating an alias for the whole set (tab) and referencing attributes via the alias is also optional, but for a long time, there was recurring bug causing an end-of-communication-channel error if attributes were referenced directly. Therefore, many people used explicit notation “just in case.”
The reason to use this transformation is very clear. It allows the complex data transformation to be 100% procedural instead of trying to put everything into pure SQL. Since the complexity of business logic very often leads to completely unreadable queries, the possibility of taking the whole process, splitting it into chunks, and putting results together becomes a very handy way to resolve not only performance optimization problems, but overall data transformation mechanism issues.
SET operations
The next logical question is whether there are some SET operations in PL/SQL that could be similar to SET operations in SQL. And indeed, there are such operations as shown here:
·
MULTISET UNION ~ equivalent to SQL clause UNION
·
MULTISET UNION DISTINCT ~ SQL clause
· MULTISET INTERSECT [DISTINCT] ~ SQL clause INTERSECT with or without checking uniqueness of objects in the final result set
· MULTISET EXCEPT [DISTINCT]~ SQL clause MINUS INTERSECT with or without checking uniqueness of objects in the final result set
All of these commands work exactly like their SQL counterparts, but the problem they could really solve is very important as illustrated by the following example. Once upon a time….. there was high-level module that would process a set of customers. These customers could belong to one or more groups:
· Attached to a personal desk
· Attached to a desk that reports to you
· Have special forms of first type
· Have special forms of second type …etc
Depending upon the exact user’s privileges (there are a lot of people at headquarters) he/she should see some combination of these groups. Over years, the lookup query became so complex that nobody was able to understand what was going on (not to mention making the query perform reasonably).
The solution was to break the query into chunks and create a collection structured as desired output. Now the processing logic had two phases:
1. Query individual groups into local nested tables:
select … bulk collect into v_SR_nt from …;
select … bulk collect into v_Waiver_nt from …;
select … bulk collect into v_SF180_nt from …;
2. Depending upon privileges, add appropriate pieces:
If v_canSeeWaiver_yn = ‘Y’
select v_SR_nr multiset union distinct v_Waiver_nt into v_out_nt from dual;
end if;
Now everything is 100% manageable and traceable. At any point in time, it is possible to log how many records were in each bucket and exactly how these buckets came together.
Memory Optimization
As already mentioned, an associative array type of collection is the best for performance optimization. Unfortunately, a lot of developers only know about the oldest version, INDEX BY BINARY_INTEGER, and completely ignore INDEX BY VARCHAR2[…]. But this second type, introduced in Oracle 10g, is much more interesting.
The problem with BINARY_INTEGER is its range restriction -2^31..2^31 (-2,147,483,647..2,147,483,647). Very often having just 10 digits it not enough if the subscript of the collection is not a real ID, but some kind of composite key (This doesn’t even take into account the resolution of key collisions!). For years, many really good developers were proposing very smart solutions, but with INDEX BY VARCHAR2 the restriction is negligible (Can anyone imagine a sane solution with 32k composite key?).
In total, associative arrays provide a much more efficient mechanism of storing/accessing intermittent data for large processing modules than any other comparable implementation. The following basic example illustrates such 3-step logic (define the output/populate the output via direct access/present final results):
declare
type list_aa is table of VARCHAR2(2000) index by VARCHAR2(50);
v_list_aa list_aa;
cursor c_emp is select ename, deptno,to_char(hiredate,'q') q_nr from emp;
v_key_tx VARCHAR2(256);
begin
-- create output structure with all possible permutations of composite keys
for r_d in (select deptno from dept order by 1) loop
v_list_aa(r_d.deptno||'|1'):='Q1 Dept#' ||r_d.deptno||':';
v_list_aa(r_d.deptno||'|2'):='Q2 Dept#' ||r_d.deptno||':';
v_list_aa(r_d.deptno||'|3'):='Q3 Dept#' ||r_d.deptno||':';
v_list_aa(r_d.deptno||'|4'):='Q4 Dept#' ||r_d.deptno||':';
end loop;
-- populate output structure via composite key
for r_emp in c_emp loop
v_list_aa(r_emp.deptno||'|'||r_emp.q_nr):= list_aa(r_emp.deptno||'|'||r_emp.q_nr)||
' '||r_emp.ename;
end loop;
-- present final results
v_key_tx:=v_list_aa.first;
loop
DBMS_OUTPUT.put_line (v_list_aa(v_key_tx));
v_key_tx:=v_list_aa.next(v_key_tx);
exit when v_key_tx is null;
end loop;
end;
From experience, ‘|’ is the most convenient separator of elements in the key; but the critical issue here is to have such a separator because it automatically excludes any possibility of key collision.
One additional point to remember: If you use numbers as a first part of a key, it is strongly recommended that you enforce the same length of numeric strings (padded by 0), otherwise it is possible to get unpredictable results.
For example, using the keys ‘2’ ,‘10’ , ‘2A’, ‘2a’:
· When spinning through the collection, key ‘10’ will come first, ‘2’ - second, ‘2A’ - third, ‘2a’ - fourth
· Because of INDEX BY VARCHAR2 and Oracle using textual comparison (one character at a time from the left), not numeric comparison, the result is as follows:
· ‘10’<’2’
· ‘2’<’2A’ and ‘2’<’2a’
· ‘2A’<’2a’ – in all character sets that represent English, upper case characters are coming first
Conclusions
Although there is a lot more about this topic that could be covered, even a short review of the possibilities of using collections to improve the overall lifecycle of the contemporary systems is useful. It should be clear that without effective utilization of sets, it is extremely hard to build efficient solutions (and even harder to expect these solutions to stay efficient over some meaningful period of time). As with any other advanced tool, you need to know properly use it. Indeed, praemonitus praemunitus (Forewarned is forearmed)!
About the Author
Michael Rosenblum
is a Development DBA at Dulcian, Inc. He is responsible for system tuning and
application architecture. He supports Dulcian developers by writing complex
PL/SQL routines and researching new features. Mr. Rosenblum is the co-author of
PL/SQL for Dummies (Wiley Press,
2006). Michael is a frequent presenter at various regional and national Oracle
user group conferences. In his native