In recent years data volumes have been increasing dramatically. This has created major challenges for traditional analytics platforms in terms of storage, management and cost. This trend will continue and accelerate, requiring companies to find better and more cost effective ways to manage their data loads.
Data Lakes vs Data Hubs
The answer to the massive increase in data volumes was the Data Lake. The main aim was to use cheap storage to hold the raw data and only extract, transform and load (ETL) data for immediate needs. This reduces the cost of storage as only the data that is required is now moved to the centralised data warehouse or higher aggregation levels within the lake.
The problem is that there is no clear consensus on what a Data Lake is and how it should be architected. The majority of focus is solely on the storage of raw data and not so much on the usage, servicing, security and privacy. With this trend, many are questioning the value of a data lake.
As a result, the Data Lake concept will be becoming less popular and likely to be replaced with well architected Data Hubs. The Data Hub is the natural progression from the Data Lake as it not only focuses on storage but also provides layers of fast structured data in a cloud environment and a servicing layer that includes self-service as well as API access.
The focus for large scale data users will be shifting to how the hub can deliver the various different data sources and insights effectively to the business and customers in near real time. This will create the much needed value proposition that was envisaged from the Lake.
De-Centralisation of Data
Cloud storage has made it easy to store vast amounts of data for a relatively low cost. This is creating many pockets of raw data stored across the organisation as teams store massive amounts of raw data.
The question that every organisation will have to ask themselves is whether they want to hold on to a centralised analytics platform in form of a data warehouse or whether a decentralised approach is better.
The advantage of the decentralised approach is that the data is stored and maintained by the owners of this data. They can best manage quality, retention, privacy and security. However, to allow synergies across the organisation a centralised framework for governance, data discovery and data servicing must be in place.
The centrepiece of this framework will be an Information Catalogue that integrates the data on a semantic level and provides tools that allow Data Scientists and business people to access the data across the organisation. Analytics sandboxes will be required that can provide masked data for analytics modelling and pattern development.
Data Governance
The requirement for well designed Data Governance frameworks will continue to grow. With decentralised Data Hubs and the huge data volumes on one side and increasingly higher demands in privacy and security regulations on the other. As a result, it will be critical for organisations to invest in new organisational structures with clearly defined accountabilities for data as an asset.
Spearheading the changes in Data Governance are roles like Corporate Data Officers (CDO) who will oversee a number of Data Stewards (aka Data Curators) in their domain. The stewards ensure the quality, management and discovery of decentralised data within the various business domains.
In a “best in class” scenario where data is stored and managed in a decentralised framework across the organisation and Governance being centralised. It will be critical that the data sources can be embedded into a centralised catalogue. New tools are coming on the market that helps to discover data across the organisation and identify synergies automatically.
Tools and processes must be in place that allows staff to create productionised data pipelines that can feed from different decentralised data sources to provide business and customer insights.
Cloud Migration
The trend of moving analytics platforms from on premise to cloud will continue. Cloud offerings provide more flexible and often more cost effective storage solutions that have a number of advantages. Firstly, it is more effective to bring the processing to the data than bringing the data to the processing. Serverless compute within a cloud environment and the ability to spin up massive clustered analytics platforms on demand for short periods of time allows users to decentralise the analytics workload in a cost effective way.
Organisations have to be careful not to fall into the trap of approaching their cloud strategy with the monolithic mindset of the last 20 years. A successful strategy is to develop a layered data architecture that hooks decentralised data aggregation levels into a centralised data delivery framework allowing all parts of the organisation to access data appropriate to their clearance and data requirements.
Wider Experimentation with Machine Learning and Artificial Intelligence
Machine Learning (ML) and Artificial Intelligence (AI) have been buzzwords for a long time now. Many R&D focused organisations have productionised ML and AI implementations, but the adoption of this technology has been slow.
In the coming years, many more organisations will start experimenting with ML and will find new use cases in which the technology will be useful and add value. In data analytics, this will require new skill sets in the BI departments. Data Engineers will need advanced knowledge in modern analytics technology such as Hadoop, Spark and various different Machine Learning algorithms.
The spectrum of the different ML training models is quite diverse and the innovation rate is still quite high in this field. As a result, any investment in this space needs to be tightly embedded in the long term data strategy in order to make sure that the value added is clearly identified before starting a new project.
Comments