ETL, short for Extract, Transform, Load, is the process of consolidating data from several disparate sources and integrating them into one central location. ETL came into prominent use by organisations in the 1970’s. It was then when organisations started storing data in multiple repositories and the need to integrate the data grew. ETL was embraced as the standard method for extracting the data from the different sources, and then transforming it prior to loading it to the target destination.
ETL adoption continued into the late 80’s and 90’s with the rise of data warehouses. Data warehouses are a distinct type of database that allowed users to access data integration from multiple source systems. The use of ETL in data warehouses has a number of benefits. Data professionals gain the ability to provide deep historical context for their enterprises, business users are more easily able to analyse and report on data relevant to their activities, and reduces the need for technical skills to write code or scripts since ETL automates processes that integrate data.
As data volumes grow, database administrators need to be mindful of possible performance issues facing their ETL process. The effects of sub-optimal database configurations and poor indexing approaches are typical issues that aren’t apparent with smaller data volumes, for example. Sub-optimal ETL code can also struggle under data volume load. Database administrators can follow some of the tips below to ensure their ETL processes are scalable from the start.
1. Build a high-level map of physical data sources and ETL processes. Doing so allows data professionals to visually anticipate the components, sequences, and sub-systems that the data model will require.
2. Populate your data model with test data at the earliest stage of development. This allows data professionals to identify and test the system in order to anticipate challenges.
3. Only pull the minimum data from source systems during batch processes. Preserve memory and processing cycles by using timestamps or sequence numbers to determine changes instead.
4. When using large tables, avoid performing cached lookups through the ETL tool. Cached lookups might be ideal with smaller datasets but will likely struggle with larger tables.
5. Use the bulk-loading features of the ETL tool instead of row-by-row processing. Bulk-loading is more appropriate for processing large data volumes whereas there’s a huge performance penalty using row-by-row.
6. Offload table joins to the database instead of performing these with the ETL software. The database will be more efficient.
7. ETL code should never use cursors. Regularly-scheduled ELT processes should use set logic instead.
8. Code parallel threads in ETL scheduling logic. There’s a performance benefit when removing unnecessary dependencies such as performing lookups sequentially. Developers and system architects can brainstorm to identify opportunities for parallel processes.
9. Code your processes to include a step to rebuild reporting table indexes once they’re done. This optimises the indexes for future reports and downstream ETL processes.
10. Turn on Auto Statistics Update on target databases.
11. Evaluate the benefit of stage data in an intermediate temp table instead of using too many joins when performing a single query.
12. Use internal variables with stored procedures to avoid parameter sniffing.
Proper planning during the design phase will prevent many challenges once the system is in production. Beginning with the end in mind is a best practice that can save many headaches in the future!
Komentáre