Sunday, September 14, 2008

Making SAS programs run faster

An alternative title for this post could have been ways to avoid sorting your data. Sort procedures in SAS are notorious memory hogs, and can radically slow down your program. Therefore, whenever you're able to avoid sorting a data set, do so. It makes your SAS programs run much faster. Also try to eliminate as many data steps as you can. The following methods can be used to improve the efficiency of your programs:

Indexing

An index is an optional data file created for a SAS data set to provide direct access to specific observations within the data set. The index file consists of entries that are organized in a tree structure, and connected by pointers. Your program knows where to look, and avoids a wasting a lot of time skimming through unnecessary data. Indexes can be created in a data step or an SQL statement.

Used the following code to create an index in a data step:

data flights
(index = (origin /* Creating a simple index */
dteflt = (fltDate flightID/unique)); /* Creating a composite index */
infile 'UPS_FLIGHTS.dat';
input flightID $7. routeID $7. fltDate date9. origin $3. dest $3. distance 8.
;
run;


Later, after you're finished with that piece of code and you want to delete the index:

proc datasets library = work nolist;
modify flights;
index delete
origin;
index delete
dteflt;
quit;


Use the following code to create an index in a SQL statement:

proc sql;
create index origin
on flights (origin);
create unique index dteFlt
on flights (fltdate, flightID);
quit;


If you want to delete the index in a SQL statement:

proc sql;
drop index origin
from flights;
drop index dteFlt
from
flights;
quit;


Conditional Merges

Conditional merges are more complicated than conventional merges. In certain instances, however, conditional merges can work much faster than conventional merges.

A conditional merge matches observations in a secondary dataset with the observations in a primary dataset. The merge steps through the primary dataset one observation at a time, seeking “matching records” in the secondary dataset.

This matching is done within the context of an ‘Event Segment’. An event segment can be defined as the set of observations in the primary dataset which shares the same values with the observations belonging to the secondary dataset for a given number of variables (i.e., trade date, issue, firm, etc.). A set of rules within the event segment allows the records from the secondary dataset to merge with the records from the primary dataset.

In our example, we created five new variables to facilitate the conditional merge logic:

FirstZ – The FirstZ variable indicates the start of an event segment.

NextZ – The NextZ variable indicates where the pointer within a particular event segment.

Zloc The Zloc variable indicates whether the record for the secondary dataset is before, in or after a particular event segment.

Continue – The Continue variable tells the whether to process the data (i.e., execute event segment rules) within an event segment or move to the next event segment. Continue set to ‘1’ implies that the secondary record is still within the event segment, and the processing should continue. Continue set to ‘0’ implies the pointer for the secondary record is after the event segment, and processing should start for the next event segment.

Noobs - Number of records in the secondary dataset for a particular event segment.

In the conditional merge, the records in the secondary dataset are merged to that of the primary dataset by the set of common variables (which also determine the event segment). If the values belonging to any of these variables in the secondary dataset are less than that of the primary dataset, Zloc is given a value of ‘-1’. If the value is greater, Zloc is given a value of ‘1’. If the values are the same, Zloc is given a value of ‘0’.

Zloc equals ‘-1’ implies that the pointer for the secondary record is before the event segment. The value for NextZ is incremented until it matches the value for FirstZ (Start of the event segment). When NextZ exceeds the number of records in the secondary dataset for the event segment, Continue is reset from ‘1’ to ‘0’ prompting the pattern to evaluate the next event segment. FirstZ (i.e., start of the next event segment) is reset to correspond with NextZ.

If Zloc equals ‘1’, the pointer for the secondary record is below the end to the event segment. Continue is reset to ‘0’ prompting the pattern to evaluate the next event segment. NextZ is reset to correspond with FirstZ (i.e., start of the next event segment).

If Zloc equals ‘0’, the pointer for the secondary record is within the event segment. During this stage, the merge will execute any rules specified within the event segment. For example, if there is a rule to put a new value for a field (i.e., Change ‘N’ to ‘Y’ for a Flag) when the primary and secondary datasets variables (i.e., Sell Date = Buy Date) are the same, that rule will be executed within the event segment. The pointer for the secondary record is also incremented by 1, causing the pointer to move to the subsequent secondary record.

At the end of the conditional merge, we should have a dataset which has all the primary records aligned with the secondary records that satisfy the rules within a particular event segment.


The code of a condition merge is given below:

data merged_data (drop=firstZ continue Zloc [any columns you don’t need to output]);
retain nextZ 1 firstZ 1 Zloc break 0;

set primary_data;
by pdate pitem ptime pprice;

continue = 1;

do while (continue=1 and nextZ<=Zcount);
set secondary data
nobs=Zcount point=nextZ;

if sdate LT pdate then Zloc = -1;
else if sdate GT pdate then Zloc = 1;
else do;
if sitem LT pitem then Zloc = -1;
else if sitem GT pitem then Zloc = 1;
else do;
if stime LT pitem then Zloc = -1;
else if stime GT pitem then Zloc=1;
else Zloc = 0;
end;
end;

/* before window, step forward in secondary unless at end */
if Zloc = -1 then do;
if nextZ LT Zcount then nextZ + 1; else continue = 0;
firstZ = nextZ;
end;

/* after window, reset the cursor to start of window and do next primary */
if Zloc = 1 then do;
continue = 0;
nextZ = firstZ;
end;

/* in window, step thru, check logic, but leave the cursor as is */
if Zloc = 0 then do;

/*Put business logic here*/

If sprice GE pprice + (pprice*0.1)
if nextZ LT Zcount then nextZ + 1; else continue = 0;
end;
end;

if sprice GE (pprice + (pprice*0.1)) then do;
output;
end;

run;


Hash Objects

Hash objects are a new addition to SAS, and are supposed to speed up your program while making the most efficient use of memory. An unfortunate drawback of this technique is that has hash objects cannot be used in SQL statements, only in data steps. In the example below, we are trying to merge the variables for the participant data set with those in the weight table. The merged data set is called results. If we try to perform a conventional merge procedure, we would have create two additional data steps to sort the participants and weight data sets by name, thus causing the program to require more memory as well as increasing its run time.

data participants;
input name $ gender $1. treatment $;
datalines;
John M Placebo
Ronald M Drug_A
Barbara F Drug_B
Alice F Drug_C
;

data weight (drop = i)
input data DATE9. @;
do i to 4;
input name $ weight @;
output;
datalines;
05MAY2006 Barbara 125 Alice 130 Ronald 170 John 160
04JUN2006 Barbara 122 Alice 133 Ronald 168 John 155
;

data results;
length name treatment $ 8 gender $ 1;
if _n_ = 1 then do;
declare hash h (dataset: 'participants');
h.defineKey('name');
h.defineData('gender','treatment');
h.defineDone();
end;
Set weight;
if h.find() = 0 then
output;
run;

proc print data = results;
format date DATE9.;
var date name gender weight treatment;
run;


Format tables

Format tables are similar to hash objects, although slightly trickier to implement. This method requires an additional step, which is creating the format table. Subsequently, the values of the format table can be appended to a dataset based on a key variable that resides in both format table and dataset, similar to the hash object shown above. In the example below, we use the format procedure to add the names of the stock symbols to the indat dataset, and calculate the count of stock issues. Alternatively, we could have sorted both the indat dataset and a dataset containing the issue names, merged them together and performed a summary or means procedure to get the number of issues. Our method eliminated two sort procedures, one merge procedure and one summary/means procedure. When analyzing millions of records, time and memory space savings from eliminating four procedures can be significant.

First, read in your data set.

data indat;
input @1 name $3. @5 ask 5.2 @11 bid 5.2; cards;
IBM 16.25 16.12
AOL 17.00 17.06
AOL 16.25 13.02
IBM 16.25 16.05
IBM 18.25 17.02
FNM 18.00 18.06
FNM 18.25 17.02
FRE 18.25 17.02
;
run;


Second, create your format table.

proc format;
value $Symbol
'IBM'='IBM'
'AOL'='America Online'
'FNM'='Fannie Mae'
'FRE'='Freddie Mac';
run;


When performing your analyses, recall the values from your format table.

proc freq data=indat;
tables name /list;
format name $Symbol.;
title "Issue Count";
run;