In a series of blogs about the design and implementation of solutions using SQL Server Integration Services (SSIS), Part 1 addressed design tasks and whether or not to stage the data in an intermediate database or operational data store. Part 2 stepped through the setup of an SSIS project in SQL Server Data Tools (SSDT), defining project parameters to hold names of the servers involved with the project. Those parameters were then paired with project-level connection managers, allowing developers and administrators to easily make changes to the server names, as the project progresses through development and testing, before finally being deployed to production.
This third blog discusses options for implementing a template package, including defining the common tasks all packages in the project will require. If the assigned project requires numerous tables to be staged in an intermediate database, a template will allow the development team to focus on the ETL process for each table, rather than needing to also repeat standard tasks in each package.
As a developer and as a DBA, I have crossed paths with basically two styles of SSIS solutions, though there are multiple variations. The first style is a single SSIS package with either multiple, semi-related data flow tasks, or a single data flow task with several disconnected data streams. An example of this type of package is shown below, with data flow tasks to move data for Orders, Customers, Shipments and Products. As a developer, it’s quick and easy to build this type of package, before moving on to the next tasks that use these tables.
As a DBA, this package can be hard to deal with, when there are problems. If the package fails, the DBA will need to scan through the SSIS execution log to find the error(s). Depending on where the error occurred, they may not be able to easily track down which data flow failed. Additionally, in the example below, if either Orders or Shipments fails, Customers and/or Products will not run, resulting in multiple tables being out of sync.
The second approach to SSIS packages is for a single package to work with a single data source and/or destination. For those who are familiar with object-oriented programming, think of a package as a class and apply the single responsibility principle to it. To apply the principle to SSIS packages, a package should do one thing and do that thing well. By that I mean, if an Orders table needs to be populated, have an Orders SSIS package. Its job is to pump data from one or more sources, and gracefully add data to the destination Orders table. If there are multiple data sources, incorporate those sources into the package, either through multiple data flows, or through combining data in a single data flow task. Follow up the data extraction, transformations and insertions by intentionally logging data about Orders that will be helpful when (not if) the package fails.
Doing all of that sounds fine and dandy for a few packages, but what if the project requires 20, 50 or 100 SSIS packages? This is where a template becomes very useful. Building out functionality once that will be repeated in every package will save time, and it provides a consistent feel and behavior across packages. With SQL Server Data Tools open and once your new project has been created, create a new SSIS package to serve as the template. As with any package, don’t leave the package named “Package,” “Package1,” etc. Give it a recognizable name, such as Template or StagingTemplate.
The control flow on the right is the beginning of the SSIS template. To define the primary area of work the package will perform, I like to have a sequence container, which holds all of the tasks needed by each package. In this template, it’s named ‘SEQC ETL Work,’ which seems pretty straightforward as to what needs to go there. Within the container are 3 tasks, including the dataflow task, DFT Forklift Data, which perform the data load process. Above and below the sequence container are Execute SQL Tasks to manage logging for this project.
As far as the control flow is concerned, once the properties have been set correctly and the red circles disappear, a pretty robust SSIS template is ready to use. At this point, the package is ready to be replicated. You don’t necessarily need to copy the file to SQL Server Data Tool’s template folder, though you can if you like. Instead, I will leave this package in the ETL solution. When a new package is needed, right-click on the template and choose Copy…
Next, click on the SSIS Packages folder within the project tree or click on the project itself and choose Paste. Rename the package from StagingTemplate1.dtsx to the name of the table being populated like Orders.dtsx. Open the new package, Orders.dtsx in my case, locate the properties of the control flow and generate a new GUID, as shown below. This will ensure your packages have unique GUIDs to decrease the likelihood of deployment issues later on.
Intentional Logging
In Part 1 of this series, technical requirements were discussed, including whether or not logging/auditing beyond built-in SSIS functionality is required by your organization, including external auditors. If it has been discovered the auditors do want information regarding the execution of SSIS packages, the following solution may or may not be sufficient. Please follow up with them, asking for specific requirements.
Even if auditors aren’t a part of your work life (and consider yourself lucky if that’s the case), to aid in unit and system testing, and production support, build in a simple logging solution. It will go a long way towards staying on the good side of your DBA and your testing team. In the staging database or a separate database, create a couple logging tables. Below are examples of tables I’ve used in the past.
The first table is the primary logging table, and I would build up the template for this table first, as it will hopefully see the most use. The second table can be implemented to store both SSIS error messages as well as user-defined messages. For example, in the data flow, if you have a lookup and a value is always expected to be found, when the lookup fails, the erred record can be directed to a data flow destination that inserts a meaningful message about that record into the ETLMessages table. While a foreign key constraint has not been implemented between the two tables, there is a relationship between the two with the ETLLogId field. Should messages be added to the ETLMessages table, they can be tied back to the overall execution of the package.
Besides the tables, I would recommend creating stored procedures to manage the flow of data into these tables. Here are examples of what the stored procedure signatures could be:
dbo.usp_StartLogEntry @PackageExecutionGuid = ?, @PackageName = ?, @ETLLogId = ? OUT dbo.usp_CompleteLogEntry @ETLLogId = ?, @ExtractedRowCount = ?, @InsertedRowCount = ?, @UpdatedRowCount = ?, @DeletedRowCount = ? dbo.usp_CreateETLMessage @ETLLogId = ?, @MessageType = ?, @MessageContent = ?
Again, to simplify the package, even in small ways, I tend to allow the stored procedure to use GetDate() or GetUTCDate() to set the date and time fields, rather than passing the value from the package.
If you haven’t worked a lot with the Execute SQL Task, especially to set parameters, we’ll walk through the first task to initialize the log. Opening the editor of the Execute SQL Task control will give you multiple panes of data. In the General pane, there are 2 properties to set, Connection and SQLStatement. To configure the Connection property, choose the data source in the drop-down list that houses the ETLExecutionLog table. In the SQLStatement value, click the ellipse (…) and enter the stored procedure name, starting with ‘exec’ and include your parameters. This is the same as executing a stored procedure in a query editor, such as Management Studio. In SSIS, placeholders for actual values are defined using the question mark, ‘?’. (Notice I used question marks in the signature of the stored procedures above.)
On the Parameter Mapping pane, add three variables, using the Add button. Based on the signature of the StartLogEntry stored procedure, the first parameter is the execution GUID, which in the SSIS package is System::ExecutionInstanceGUID. Change the DataType to GUID and set the Parameter Name to 0. The second parameter is the package’s name, so set the variable used to System::PackageName. This field is a varchar and the Parameter Name is 1. Finally, the stored procedure should be returning an ETLLogId. This is an output parameter and its Parameter Name is 2. This parameter requires a new variable to be defined in the template. Instead of choosing a system variable from the Variable Name drop-down list, choose <New Variable…> from the top of the list. Define the new variable, similar to the variable below. This variable will be available throughout the SSIS package.
At this time, the package has one user-defined variable. When we come around to finalizing the ETLExecutionLog record, the data flow task will need to track row counts at various points in the data flow. The row counts can be stored in four new variables, such as extractedRowCount, insertedRowCount, updatedRowCount and deletedRowCount. The screenshot below shows the five variables. If you don’t see the Variables subwindow (or an unpinned icon) in the designer, usually at the bottom of the screen, click on the SSIS designer window and go to the View menu. Then, choose Other Windows, followed by Variables from the submenu.
The last task in the template, SQL Finalized ETLLog, can now be configured, defining the connection and SQLStatement for the stored procedure to finalize the log record. In the Parameter Mapping pane, add five parameters and map them to the appropriate variables defined earlier.
Voila! Your template now manages an execution log. When implementing subsequent packages, these steps should not need to be repeated. All of the variables and their mappings are defined.
Inside the Sequence Container
At this point, you may, or may not, want to define tasks within the Sequence Container control. In my case, I want to set the properties of the task that will truncate a side table that will be created along with each “usable” table. To configure the Execute SQL Task, again the Connection is chosen from the drop-down list. This time, instead of setting the SQLStatement property on the General tab, I’m going to do so from the Expressions tab.
Clicking on the ellipse (…) button next to the Expressions property, there are several properties that can be configured programmatically, including SqlStatementSource. Here, the truncate statement can be built. As the actual ODS packages are built, if the developer(s) name the SSIS package the same as the destination table the package is populating, the following expression can be used. In my case, all of the ODS tables are going to be stored in the ODS schema.
“TRUNCATE TABLE ods.” + @[System::PackageName]
Click Evaluate Expression to see the expression evaluated to:
TRUNCATE TABLE ods.Template
…or whatever you named the template.
Another task is now complete in the template and doesn’t need to be fussed with again, unless a design decision is made later on requiring changes to this control.
At this point, adding definition to the Data Flow Task or to the Bulk Update statement may not be appropriate. Because table definitions will be different, changes need to be made within every package. If you want to add additional tasks, such as the Data Flow task and Execute SQL Task, but leave them undefined in the template, that’s totally fine. To avoid build errors, go ahead and disable those unconfigured controls. When you start working on the actual package, re-enable them and you are all set!
In the next part of this series, we’ll venture into the Data Flow Task to populate a specific table in the operational data store.
Click here to read Part 4 – Add Functionality to Staging Package