Creating Data Products in a Data Mesh, Data Lake or Lakehouse for Use In Analytics
Most companies today are storing data and running applications in a hybrid multi-cloud environment. Analytical systems tend to be centralised and siloed like data warehouses and data marts for BI, Hadoop or cloud storage data lakes for data science and stand-alone streaming analytical systems for real-time analysis. These centralised systems rely on data engineers and data scientists working within each silo to ingest data from many different sources, clean and integrate it for use in a specific analytical system or machine learning models. There are many issues with this centralised, siloed approach including multiple tools to prepare and integrate data, reinvention of data integration pipelines in each silo and centralised data engineering with poor understanding of source data unable to keep pace with business demands for new data. Also master data is not well managed.
To address these issues, new data architectures have emerged attempting to accelerate creation of data for use in multiple analytical workloads. Data Mesh is a decentralised data architecture with domain-oriented data ownership and decentralised self-service data engineering to create a mesh of data products serving multiple analytical systems. Also, Data Lakes can be used for the same thing and integrated with Data Warehouses or Lakehouses so lower latency data products can be created once and used in streaming analytics, business intelligence, data science and other analytical workloads .
This 2-day class examines the strengths, and weaknesses of data lakes, data mesh and data lakehouses and at how multiple domain-oriented teams can use common data infrastructure software to create trusted, compliant, reusable, data products in a Data Mesh or Data Lake for use in data warehouses, data lakehouses and data science to drive value. The objective is to shorten time to value while also ensuring that data is correctly governed in a decentralised environment. It also looks at the organisational implications of these architectures and how to create sharable data products for master data management AND for use in multiple analytical workloads. Technologies discussed includes data catalogs, self-service data integration, Data Fabric, DataOps, data warehouse automation, data marketplaces and data governance platforms.
This seminar is intended for business data analysts, data architects, chief data officers, master data management professionals, data scientists, IT ETL developers, and data governance professionals. It assumes you understand basic data management principles and data architecture plus a reasonable understanding of data cleansing, data integration, data catalogs, data lakes and data governance.
Attendees will learn about:
- Strengths and weaknesses of centralised data architectures used in analytics
- The problems caused in existing analytical systems by a hybrid, multi-cloud data landscape
- What is a Data Mesh a Data Lake and a Data Lakehouse? What benefits do they offer?
- What are the principles, requirements, and challenges of implementing these approaches?
- How to organise to create data products in a decentralised environment so you avoid chaos
- The critical importance of a data catalog in understanding what data is available as a service
- How business glossaries can help ensure data products are understood and semantically linked
- An operating model for effective federated data governance
- What common data infrastructure software is required to operate and govern a Data Mesh, a Data Lake or a Data Lakehouse?
- An Implementation methodology to produce ready-made, trusted, reusable data products
- Collaborative domain-oriented development of modular and distributed DataOps pipelines to create data products
- How a data catalog and automation software can be used to generate DataOps pipelines
- Managing data quality, privacy, access security, versioning, and the lifecycle of data products
- Publishing semantically linked data products in a data marketplace for others to consume and use
- Consuming data products in an MDM system
- Consuming and assembling data products in multiple analytical systems to shorten time to value
MODULE 1: WHAT IS DATA MESH, A DATA LAKE AND A LAKEHOUSE? WHY USE THEM?
This session looks at the challenges facing companies trying to become data driven and at the strengths and weaknesses of current centralised data architectures used in analytics. It then introduces Data Lakes, Data Mesh and Data Lakehouse as potential ways to address current problems. It explores the pros and cons of each of these, explains how they work and how they enable creation of trusted, reusable data products for use in multiple analytical workloads. It also asks if combining these approaches is advantageous or not.
- Data complexity in a hybrid, multi-cloud environment
- The growth in new data sources
- Centralised data architectures in use in existing analytical systems
- Strengths and weaknesses of the centralised approach
- What is a Data Mesh?
- Data Mesh principles
- How does decentralised Data Mesh work?
- What is a data product?
- What types of data product can you build?
- Decentralised development of data products
- Pros and cons of Data Mesh
- What are the challenges with this decentralised approach?
- Is data management software ready for Data Mesh?
- How will Data Mesh impact your current IT organisation and data culture?
- Is federated data governance possible?
- Pros and cons of Data Lakes
- The merging of data warehouses and data lakes
- The move from just data science to multi-purpose data lakes
- What is a Data Lakehouse?
- How does a Data Lakehouse work?
- Pros and cons of a Data Lakehouse
- Can you combine Data Lakes, Lakehouses and Data Mesh and why would you do this?
- Implementation requirements to create data products
- Federated operating model
- Common business vocabulary
- Data producers and data consumers
- Architecture independence
- A unified data platform for building any pipeline to process any data
- DataOps – component-based CI/CD pipeline development
- Distributed pipeline execution
- Reusable, semantically linked data products
- Governance of a distributed data landscape
- Key technologies: Data Fabric, Data Catalogs, data classifiers, Data Marketplace, Data Warehouse Automation tools
- Vendor’s offerings in the market – Alation, AWS, BigID, Cambridge Semantics, Collibra, Global IDs, Google, IBM, Informatica, Microsoft, Oracle, Qlik, Talend, SAP, SAS, WhereScape, Zaloni
MODULE 2: METHODOLOGIES FOR CREATING DATA PRODUCTS
This session looks at how to produce business ready, reusable data products for use by data consumers in multiple analytical use cases who need it to drive business value. It also looks how master data products can also be produced for use in master data management.
- Creating a program office
- Decentralised development of data products in a Data Mesh, Data Lake or Lakehouse
- The special and critical case of master data
- A best practice step-by-step methodology for building reusable data products
- How does structured, semi-structured and unstructured data impact the methodology?
- Applying DataOps development practices to data product development?
MODULE 3: USING A BUSINESS GLOSSARY TO DEFINE DATA PRODUCTS
This session looks at how you can create common data names and definitions for your data products in a business glossary so data consumers can understand the meaning of the data produced and available in a Data Mesh or a Data Lake. It also looks at how business glossaries have become part of a data catalog
- Why is a common vocabulary relevant?
- Data catalogs and the business glossary
- The Data Catalog market, e.g., Alation, Amazon Glue, Cambridge Semantics ANZO Data Catalog, Collibra Catalog, Data.world, Denodo Data Catalog, Google Data Catalog, Hitachi Vantara Lumada, IBM Watson Knowledge Catalog, Informatica Axon and EDC, Microsoft Azure Purview Data Catalog, Qlik Catalog, Zaloni Data Platform
- Roles, responsibilities, and processes needed to manage a business glossary
- Jumpstarting a business glossary with a data concept model
- Defining data products using glossary terms
- Using a catalog and glossary to ensure data products are semantically linked?
MODULE 4: STANDARDISING DEVELOPMENT AND OPERATIONS IN A DATA MESH, DATA LAKE OR LAKEHOUSE
This session looks at how to standardise the setup in each business domain to optimise development of data products in a Data Mesh, a Data Lake or Lakehouse
- The importance of a program office
- Implementing Data Mesh on a single cloud Versus a hybrid multi-cloud environment
- Implementing a Data Lake or Lakehouse
- Standardising the domain implementation process – ingest, process, persist, serve
- Creating zones in a Data Mesh domain, a Data Lake or Lakehouse to produce and persist data products
- Selecting Data fabric software for building data product
- Steps-by-step data product development
- Data source registration
- Automated data discovery, profiling, sensitive data detection, governance classification, lineage extraction and cataloguing
- Data ingestion
- Global and domain policy creation for federated governance of classified data
- Data product pipeline development
- Data product publishing for consumption
MODULE 5: BUILDING DATAOPS PIPELINES TO CREATE MULTI-PURPOSE DATA PRODUCTS
This session looks at designing and developing modular DataOps pipelines to produce trusted data products using Data Fabric software
- Collaborative pipeline development & orchestration to produce data products
- Designing component based DataOps pipelines to produce data products
- Using CI/CD to accelerate development, testing and deployment
- Designing in sensitive data protection in pipelines
- Processing streaming data in a pipeline
- Processing unstructured data in a pipeline using ML
- Generating data pipelines using Data Warehouse Automation tools
- Making data products available for consumption in a Data Mesh or Data Lake using a data marketplace
- The Enterprise Data Marketplace – enabling information consumers to shop for data
- Serving up trusted data products for use in multiple analytical systems and in MDM
- Consuming data products in other pipelines for use in data warehouses, lakehouses, data science sandboxes, graph analysis and MDM
MODULE 6: – IMPLEMENTING FEDERATED DATA GOVERNANCE TO PRODUCE AND USE COMPLIANT DATA PRODUCTS
With data highly distributed across so many data stores and applications, on-premises, in multiple clouds and the edge, many companies are struggling to govern data throughout its lifecycle. This is critically important in a Data Mesh where federated computational data governance is a fundamental principal, data product development is decentralised, and data products are shared and consumed across the organisation. It is also paramount in a Data Lakehouse and across the whole hybrid multi-cloud data landscape. This session looks at how this can be achieved.
- What is involved in federated data governance?
- How do you implement this across a hybrid, multi-cloud distributed data landscape?
- Understanding compliance obligations
- Types of data governance policies
- Understanding Global Vs local policies when creating a Data Mesh, a Data Lake or Data Lakehouse
- Defining sensitive data types
- Using the data catalog for automated data profiling, quality scoring and sensitive data type classification
- Defining and attaching policies to classified data in a data catalog
- Creating sharable master data products and reference data products for MDM and RDM
- Ensuring data quality in data product development
- Protecting sensitive data in data product development for data privacy compliance
- Governing data product version management
- Governing consumer access to data products containing sensitive data
- Prevent accidental oversharing of sensitive data products using DLP
- Governing data retention of data products in-line with compliance and legal holds
- Monitoring and data stewarding to ensure policy enforcement
- Data catalog and data fabric technologies to help govern data across a distributed data landscape
- Types of data governance offerings
- Alation, Ataccama, Collibra, Dataguise
- Google Cloud IAM, Data Catalog, BigQuery, Dataplex and DLP
- IBM Cloud Pak for Data, Watson Knowledge Catalog, Optim & Guardium
- Hitachi Vitara, Immuta, Imperva
- Informatica EDC and Axon
- Microsoft 365 Compliance Centre and Azure Purview
- Okera, OneTrust Data Governance Suite
- Oracle Enterprise Data Management Cloud, Privitar, SAP Data Intelligence
- Talend, TopQuadrant
MIKE FERGUSONManaging Director, Intelligent Business Strategies Limited
Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an independent analyst and consultant he specialises in data management and analytics. With over 38 years of IT experience, Mike has consulted for dozens of companies. He has spoken at events all over the world and written numerous articles.
Mike is Chairman of Big Data LDN – the fastest growing Big Data conference in Europe, and chairman of the CDO Exchange. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of Database Associates. He teaches popular master classes in Analytics, Big Data, Data Governance & MDM, Data Warehouse Modernisation and Data Lake operations.Read more