|
[image source: Amazon.com]
|
Book Title:
Azure Data and AI Architect Handbook: Adopt a structured
approach to designing data and AI solutions at scale on Microsoft Azure
Book Details:
Pages: 284
Publication Date: 2023-07-31 (August 2023, in the book)
Author(s):
"It's like reading diluted & neutered sets of Microsoft Azure documentation" (i.e., no rich cross-linking to additional relevant content - and almost no hands-on examples)
Read on, for why I had that feeling...
At 284 pages (but only 245, if we exclude the Index) – this book impressively attempts to cover a wide range of information that will be of interest to anyone that wishes to establish an architect-level awareness of Azure data and AI architecture capabilities.
Note: For my review – I read a PDF version of the book that I downloaded from Packt’s web site, AFTER the publication date of the book.
Three key criticisms I have - with almost the entire book:
- A significant lack of additional suggested reading links (beyond just the paltry few citations of Microsoft Azure documentation). There is a severe dearth of reference to other related material, articles, books, research papers - that would deeply enrich the reader's experience - and magnify the educational value of this book.
- With the noticeable exception of Chapter-8, there is a severe paucity of actual detailed examples in the majority of the book's pages.
- The lack of a companion github repository - providing hands-on examples.
This book suffers from a lack, in almost all chapters, of any in-depth, detailed discussion – of real-world examples & case studies. In Chapter-3 (Page-39), fraud detection is briefly mentioned – and would have made an EXCELLENT example / case study – on which to elaborate in that chapter.
In almost every instance – the reader would be better served by simply reading the Microsoft Azure documentation – rather than the diluted treatment given to many topics in the various chapters – most of which lack the basic courtesy of pointing the reader to the appropriate online documentation landing page, for the services discussed.
What I liked:
Chapter-3’s discussion of Kappa and and Delta lake architectures.
Chapter-6’s coverage of Data Warehousing (this is the best-written chapter in the entire book, and provides detail examples to clearly explain concepts).
What could be improved in the next edition:
Better use of color – and consistent use of color - in diagrams.
Page-xvi, hyperlink to errata page is not enabled.
MAJOR MISS: Inclusion of a companion github project for the book, to provide some hands-on exercises.
Chapter-1 (page-4): The first sentence of this book, published in July/August 2023 - refers to some growth predictions, in the past..."Data generation is growing at an exponential rate. 90 percent of data in the world was generated in the last 2 years, and global data creation is expected to reach 181 zettabytes in 2022". A better quote would be to show the expected growth by 2030, at the very least.
Chapter-1 (Page-7): The Data Architecture reference diagram does not reflect a “Data orchestration and processing” layer – but this is called out in the bullet list enumeration of diagram elements.
Chapter-1 (Page-8): Appears to still have some internal / editor reminder note embedded in the text, re: “(Add what data ingestion services will be discussed later in the book).”
Chapter-1 (Page-9): Appears to still have some internal / editor reminder note embedded in the text, re:“(Add what data storage services will be discussed later in the book).”
Chapter-1 lacks any suggested links, additional reading – to enrich the reader’s experience.
NOTE: This criticism holds TRUE for the MAJORITY of the book's chapters.
Chapter-1 is missing a section to introduce the fundamental concepts of Data Architecture Principles
Chapter-1 would benefit from having a table to provide a comparison of the capabilities across the major Cloud Service Providers (CSPs) – i.e., Azure, AWS, GCP.
Microsoft’s choice of the acronym WAF (for Well-Architected Framework) – is unfortunate – as it could easily be confused with the more common usage (Web Application Firewall). For example, on page-18, there is an [incorrect] link to (“Azure Well-Architected Framework review - Azure Application Gateway v2” documentation) – that clearly refers to ”WAF” in the context of a Web Application Firewall (“Be aware of Application Gateway capacity changes when enabling WAF”)
Chapter-2 (Page-18) – The hyperlink to Microsoft Azure WAF documentation page is incorrect, and not enabled.
Chapter-2 (Page-18) – There is supposed to be a link to refer the reader to the Well-Architected Framework (WAF) main page (re: “For the complete framework…”) – but the link that is provided – is to a sub-page– referring to Application Gateway concerns – “Azure Well-Architected Framework review - Azure Application Gateway v2”.
Chapter-2 (Page-23) - The section on cost optimization discussion – would be better placed near the end of the book, in a dedicated chapter for that topic.
Chapter-2 (page-23) - The advice to “Whenever possible, look for cloud-native offerings to offload your workloads.” – seems incongruent with the section’s focus on cost optimization. If you don’t have significant variability in your scaleability requirements – and you have sufficient compute power in an existing data center – you may be able to more efficiently manage some CPU/memory intensive workloads – on your existing data center hardware.
Chapter-2 would greatly benefit by having some illustrative worked examples of the costs for different cost variances – based on different deployment choices of some simple Data Architecture examples. Instead of saying it can vary across regions, or network ingress/egress can increase costs, or hosting in different regions can increase latencies. In particular, citing some actual examples from the barely mentioned Azure calculator, and Total Cost of Ownership (TCO) calculator.
Chapter-2 (page-27) - The very brief discussion of “Using data partitioning” – would be much better if it included a discussion of the why, for each strategy mentioned.
Chapter-2 (page-29) – The enumeration of the concepts of Subscriptions, Resource groups, and Management groups – is not in the same order as the hierarchy depicted in the corresponding diagram – which introduces confusion – and needless burden on the reader to mentally CORRECT what they may have thought was safe to infer from the ordering of the list. Rule #1: Make learning EASY for the reader
Chapter-2 (page-29) - the book still refers to the old name ("Azure Active Directory (AAD)"). It should be updated to reflect the new name ("Microsoft Entra ID") - that was announced July 11th, BEFORE the book was published.
Chapter-2 (page-30) – “The architecture of the data management landing zone is quite extensive and may be hard to clearly visualize in this book” – supports my belief that this book should actually be closer to 450-650 pages in length.
Chapter-2 (page-30) the link to the data management landing zone is not hyperlink enabled – and when the text is copied – it mangles the link, putting parts of the URL out of their correct order.
Chapter-2 (page-31): "Services shown in color are mandatory for the landing zone, whereas services that appear in gray are optional" re: Fig 2.2. Is *very* confusing - as there doesn't appear to be any services colored gray. The only thing gray - are the layers. There appear to only be services in either black, or reddish-orange.
Chapter-3 discusses different strategies for ingestion – but the decision criteria is often embedded in paragraphs - a decision-tree or decision criteria would perhaps be beneficial to help communicate the information more visually. This would be especially helpful when there are more than two possible choices discussed.
Chapter-3 (page-51): The term SHIRs is introduced, and is defined as self-hosted IRs. However, nowhere in the previous pages, was IR defined as an acronym. For the benefit of the reader, the full term should be defined here as Self-Hosted Integration Runtime.
Chapter-3 (page-57): The discussion on Event Hub should include a link to the “Azure Event Hubs quotas and limits”) in the Azure documentation.
Chapter-6 (page-135): The reference to “The data vault method” – should provide the proper attribution to its creator: "The author of the third approach to the subject of the data warehouse, known as the Data Vault, is Dan Linstedt. The Data Vault is the result of 10 years of his research efforts to ensure the consistency, flexibility and scalability of the warehouse. The first results of his research in this field are five articles on this subject, which were published in 2000. In contrary to Inmon’s view, Linstedt assumes that all available data from the entire time period should be loaded into the warehouse. This is known as the “single version of the facts” approach. As with Kimball’s star schema, with the Data Vault Linstedt introduces some additional objects to organize the data warehouse structure. These objects are referred to as the hub, satellite and link". [source]
Chapter-7 (page-144): "Figure 7.6 – Power BI Premium as a superset of AAS", the light-colored font is *much* more difficult to read.
Chapter-7 should introduce the concepts of taxonomy and ontology – and provide reference to some public domain examples.
For example:
Chapter-8 (page-154): The link to the pricing for Power BI is __very__ incongruent with the *complete* lack of reference to any links for other service pricing details – as well as the lack of any citation in the book to the __very important__ documentation links for service-specific Quotas and Limits.
Chapter-8 itself – feels like it is VERY out-of-place, and does not feel like it belongs in an ARCHITECT book. It is written to a level of detail for a DEVELOPER, that I WISH the *PREVIOUS* 7 chapters had demonstrated.
Chapter-8 begs the question – why does it delve into the development details – when none of the previous chapters have touched on such matters?
Chapter-9 (pages 185-187): Discusses Azure Cognitive Services (re: Speech, Vision) – but doesn’t connect the dots to how this applies to Data Architecture. Further, the level of discussion barely goes beyond “brochure-ware” – and smells of a marketing ploy – not a chapter intent on teaching how to use the Azure AI services.
Chapter-9 (189-…): Begins discussing the “Azure OpenAI Service” – and though it makes a vague reference to *some* hallucination concerns– it DOES NOT cite the relevant OpenAI papers: GPT-4 Technical Report (27 March 2023); nor the GPT-4 System Card (27 March 2023) – that latter of which, specifically includes this explicit warning: “In particular, our usage policies prohibit the use of our models and products in the contexts of high risk government decision making (e.g, law enforcement, criminal justice, migration and asylum), or for offering legal or health advice.”
Chapter-10: Does not provide any links to the relevant standards that are cited (i.e., DCAM, DAMA DMBOK)
Chapter-11 (page-228): states “The only significant choice to make here is which version of the TLS protocol to choose: TLS 1.0, TLS 1.1, or TLS 1.2”. This ignores the fact that TLS 1.0 and TLS 1.1 have been deemed to be vulnerable – and TLS 1.2 should be minimally enforced. Further, this sentence should include TLS 1.3. The appropriate NIST paper for TLS should be cited for exclusion of TLS 1.0 and TLS 1.1, and the NIST recommendation/guidance for adoption of TLS 1.2, and TLS 1.3.
Book's companion Github repository:
N/A - completely missing
*My* Additional Suggested Microsoft Documentation References:
- https://learn.microsoft.com/en-us/azure/architecture/data-guide/
- https://learn.microsoft.com/en-us/azure/architecture/data-guide/big-data/
- https://learn.microsoft.com/en-us/azure/architecture/example-scenario/data/data-warehouse/
- https://learn.microsoft.com/en-us/azure/architecture/solution-ideas/articles/enterprise-data-warehouse/
- https://learn.microsoft.com/en-us/azure/architecture/solution-ideas/articles/advanced-analytics-on-big-data/
- https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/data/enterprise-bi-adf/
- https://learn.microsoft.com/en-us/azure/architecture/example-scenario/data/small-medium-data-warehouse/
- https://learn.microsoft.com/en-us/azure/architecture/example-scenario/analytics/enterprise-bi-synapse/
- https://learn.microsoft.com/en-us/azure/architecture/example-scenario/dataplate2e/data-platform-end-to-end/
- https://learn.microsoft.com/en-us/azure/storage/common/storage-service-encryption
- “Data in Azure Storage is encrypted and
decrypted transparently using 256-bit AES encryption, one of the strongest
block ciphers available, and is FIPS 140-2 compliant.”
- https://learn.microsoft.com/en-us/windows/win32/seccng/cng-portal
- "Cryptography API: Next Generation (CNG) is the
long-term replacement for the CryptoAPI. CNG is designed to be extensible at
many levels and cryptography agnostic in behavior."
- https://en.wikipedia.org/wiki/Advanced_Encryption_Standard
- “At present, there is no known practical
attack that would allow someone without knowledge of the key to read data
encrypted by AES when correctly implemented.”
*My* Additionally suggested background reading:- Building a Scalable Data Warehouse with Data Vault 2.0 (2015, by Dan Linstedt, and Michael Olschimke)
- https://www.snowflake.com/resource/5-best-practices-for-data-warehouse-development/
- https://www.astera.com/type/blog/data-warehouse-concepts/
- https://www.geeksforgeeks.org/data-warehouse-architecture/
- https://www.geeksforgeeks.org/difference-between-kimball-and-inmon/
- https://medium.com/cloudzone/inmon-vs-kimball-the-great-data-warehousing-debate-78c57f0b5e0e
- https://www.incorta.com/blog/death-of-a-star-schema-redux-moving-beyond-inmon-and-kimball
- Historically, there were two models to choose
from: Ralph Kimball’s “bottom-up” approach to mapping atomic data or Bill
Inmon’s “top-down” model. In recent years, however, the technology that
supports BI and data warehousing has evolved rapidly. Now, there is a third
option for data warehousing and BI in a post-star-schema, post-ETL world: non-dimensional
data modeling.
- https://go.incorta.com/recording-death-of-the-star-schema
- https://www.nearshore-it.eu/articles/technologies/data-warehouse-architecture/
- Data warehouses are inextricably associated with
the American computer scientist Bill Inmon, born in 1945,
who is widely considered the father of the data warehouse. In 2007, Bill Inmon
was named by Computerworld as one of the ten people who have had the most
significant impact on IT development in the past 40 years. In 1992, Inmon
defined the data warehouse as follows:
- "A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile collection of data in support of management’s
decision-making process”
- Next to Inmon, Ralph Kimball, born in 1944, is
another key figure in the field of data warehousing. Unlike Inmon’s definition
of a data warehouse, where the emphasis is on the characteristics of the
warehouse, Kimball focuses on its purpose: “a copy of transaction data
specifically structured for query and analysis.”
- The author of the third approach to the subject
of the data warehouse, known as the Data Vault, is Dan Linstedt. The Data Vault is
the result of 10 years of his research efforts to ensure the consistency,
flexibility and scalability of the warehouse. The first results of his research
in this field are five articles on this subject, which were published in 2000.
- "In contrary to Inmon’s view, Linstedt assumes
that all available data from the entire time period should be loaded into the
warehouse. This is known as the “single version of the facts” approach. As with
Kimball’s star schema, with the Data Vault Linstedt introduces some additional
objects to organize the data warehouse structure. These objects are referred to
as the hub, satellite and link."
-
https://www.analytics8.com/blog/is-dimensional-data-modeling-still-relevant-in-the-modern-data-stack/
- Is dimensional data modeling still relevant in
the modern data stack?
- Yes—specifically for defining
requirements and creating a modular solution presenting data for analytics.
- In 2017, Gartner estimated that 60% of data warehouse
implementations would have only limited acceptance or fail entirely.
- https://www.gartner.com/en/newsroom/press-releases/2015-09-15-gartner-says-business-intelligence-and-analytics-leaders-must-focus-on-mindsets-and-culture-to-kick-start-advanced-analytics
- YouTube: Kimball in the context of the modern
data warehouse: what's worth keeping, and what's not
- https://www.youtube.com/watch?v=3OcS2TMXELU
- Innovative Approaches for efficiently
Warehousing Complex Data from the Web
- https://arxiv.org/abs/1701.08643
- Toward a New Approach for Modeling Dependability
of Data Warehouse System
- https://arxiv.org/abs/1311.1181
- The End of an Architectural Era for Analytical
Databases
- https://arxiv.org/abs/1209.1425
- An Approach to Handle Big Data Warehouse
Evolution (2018)
- https://arxiv.org/abs/1809.04284
- On building Information Warehouses
- https://arxiv.org/pdf/0910.2638.pdf
- A new paradigm for accelerating clinical data
science at Stanford Medicine
- https://arxiv.org/abs/2003.10534
- Abstract: "Stanford Medicine is building a new data
platform for our academic research community to do better clinical data
science. Hospitals have a large amount of patient data and researchers have
demonstrated the ability to reuse that data and AI approaches to derive novel
insights, support patient care, and improve care quality. However, the
traditional data warehouse and Honest Broker approaches that are in current
use, are not scalable. We are establishing a new secure Big Data platform that
aims to reduce time to access and analyze data. In this platform, data is
anonymized to preserve patient data privacy and made available preparatory to
Institutional Review Board (IRB) submission. Furthermore, the data is
standardized such that analysis done at Stanford can be replicated elsewhere
using the same analytical code and clinical concepts. Finally, the analytics
data warehouse integrates with a secure data science computational facility to
support large scale data analytics. The ecosystem is designed to bring the
modern data science community to highly sensitive clinical data in a secure and
collaborative big data analytics environment with a goal to enable bigger,
better and faster science."
From the Amazon Listing:
"With data’s growing importance in businesses, the need for
cloud data and AI architects has never been higher. The Azure Data and AI
Architect Handbook is designed to assist any data professional or academic
looking to advance their cloud data platform designing skills. This book will
help you understand all the individual components of an end-to-end data
architecture and how to piece them together into a scalable and robust
solution."
"You’ll begin by getting to grips with core data architecture
design concepts and Azure Data & AI services, before exploring cloud
landing zones and best practices for building up an enterprise-scale data
platform from scratch. Next, you’ll take a deep dive into various data domains
such as data engineering, business intelligence, data science, and data
governance. As you advance, you’ll cover topics ranging from learning different
methods of ingesting data into the cloud to designing the right data
warehousing solution, managing large-scale data transformations, extracting
valuable insights, and learning how to leverage cloud computing to drive
advanced analytical workloads. Finally, you’ll discover how to add data
governance, compliance, and security to solutions."
"By the end of this book, you’ll have gained the expertise
needed to become a well-rounded Azure Data & AI architect."
What you will learn
- "Design scalable and cost-effective cloud data
platforms on Microsoft Azure"
- "Explore architectural design patterns with
various use cases"
- "Determine the right data stores and data
warehouse solutions"
- "Discover best practices for data orchestration
and transformation"
- "Help end users to visualize data using
interactive dashboarding"
- "Leverage OpenAI and custom ML models for
advanced analytics"
- "Manage security, compliance, and governance for
the data estate"
Who this book is for
"This book is for anyone looking to elevate their skill set
to the level of an architect. Data engineers, data scientists, business
intelligence developers, and database administrators who want to learn how to
design end-to-end data solutions and get a bird’s-eye view of the entire data
platform will find this book useful. Although not required, basic knowledge of
databases and data engineering workloads is recommended."