intltechventures.blogspot.com: 2020-01-11 Saturday

This blog posting is a placeholder for interesting resources related to Enterprise Data Lakes:

Background Reading:

https://en.wikipedia.org/wiki/Data_lake

"A data lake is a system or repository of data stored in its natural/raw format,^[1] usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). ^[2]"
"A data swamp is a deteriorated and unmanaged data lake that is either inaccessible to its intended users or is providing little value.^[3]"

https://www.talend.com/resources/what-is-data-lake/

"A data lake is a central storage repository that holds big data from many sources in a raw, granular format. It can store structured, semi-structured, or unstructured data, which means data can be kept in a more flexible format for future use. When storing data, a data lake associates it with identifiers and metadata tags for faster retrieval."

"Coined by James Dixon, CTO of Pentaho, the term “data lake” refers to the ad hoc nature of data in a data lake, as opposed to the clean and processed data stored in traditional data warehouse systems."

"A data lake works on a principle called schema-on-read. This means that there is no predefined schema into which data needs to be fitted before storage. Only when the data is read during processing is it parsed and adapted into a schema as needed. This feature saves a lot of time that’s usually spent on defining a schema. This also enables data to be stored as is, in any format."
Also see:
- https://www.talend.com/resources/definitive-guide-cloud-data-warehouses/
- https://www.talend.com/blog/2017/11/20/introducing-data-lake-quick-start-talend-amazon-web-services-cognizant/

Suggested Amazon Books:

Data Lake for Enterprises: Lambda Architecture for building enterprise data systems

Additional Suggested Reading:

An excellent, in-depth treatment...

Amazon AWS Data Lakes:

https://aws.amazon.com/big-data/datalakes-and-analytics/

https://aws.amazon.com/lake-formation/

https://aws.amazon.com/athena/

https://github.com/awslabs/aws-athena-query-federation/releases

https://aws.amazon.com/emr/

https://aws.amazon.com/redshift/

https://aws.amazon.com/quicksight/

Microsoft - Azure:

https://azure.microsoft.com/en-us/services/data-lake-analytics/

https://docs.microsoft.com/en-us/azure/data-lake-analytics/

https://docs.microsoft.com/en-us/azure/data-factory/

https://docs.microsoft.com/en-us/azure/data-factory/

https://azure.microsoft.com/en-us/services/storage/data-lake-storage/

https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction

https://azure.microsoft.com/en-us/services/synapse-analytics/

https://docs.microsoft.com/en-us/azure/sql-data-warehouse/

https://www.datanami.com/2019/11/08/microsoft-aims-for-data-analytics-unification-with-synapse/

Microsoft - SQL Server 2019:

https://www.microsoft.com/en-us/sql-server/sql-server-2019

https://www.microsoft.com/en-us/sql-server/sql-server-2019-comparison
https://www.microsoft.com/en-us/sql-server/sql-server-2019-features
https://www.microsoft.com/en-us/sql-server/sql-server-2019-pricing
https://www.microsoft.com/en-us/sql-server/sql-server-downloads
https://docs.microsoft.com/en-us/sql/sql-server/?view=sql-server-ver15

https://docs.microsoft.com/en-us/sql/sql-server/editions-and-components-of-sql-server-version-15?view=sql-server-ver15
https://docs.microsoft.com/en-us/sql/sql-server/sql-server-version-15-release-notes?view=sql-server-ver15
https://docs.microsoft.com/en-us/sql/sql-server/what-s-new-in-sql-server-ver15?view=sql-server-ver15
https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-guide?view=sql-server-ver15

"PolyBase enables your SQL Server instance to process Transact-SQL queries that read data from external data sources. SQL Server 2016 and higher can access external data in Hadoop and Azure Blob Storage. Starting in SQL Server 2019, you can now use PolyBase to access external data in SQL Server, Oracle, Teradata, and MongoDB."

https://cloudblogs.microsoft.com/sqlserver/2018/09/25/introducing-microsoft-sql-server-2019-big-data-clusters/

"SQL Server and Spark are deployed together with HDFS creating a shared data lake"
"Data sources that can be integrated by PolyBase in SQL Server 2019"
https://docs.microsoft.com/en-us/sql/big-data-cluster/big-data-cluster-overview?view=sql-server-ver15
https://docs.microsoft.com/en-us/sql/big-data-cluster/concept-data-pool?view=sql-server-ver15
https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-data-pool-ingest-spark?view=sql-server-ver15
https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-mssql-connector?view=sql-server-ver15

IBM Data Lake:

https://www.ibm.com/analytics/data-lake

Apache Projects:

https://flink.apache.org/

"Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale
https://aws.amazon.com/blogs/big-data/use-apache-flink-on-amazon-emr/

https://flume.apache.org/

"Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application."

https://hive.apache.org/

"The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage."
https://aws.amazon.com/emr/features/hive/

https://hudi.incubator.apache.org/newsite-content/

"Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing."

https://spark.apache.org/

"Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming."
https://aws.amazon.com/emr/features/spark/

Other Open Source Projects:

https://prestodb.io/

"Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes."
https://aws.amazon.com/emr/features/presto/
https://aws.amazon.com/big-data/what-is-presto/

"Presto (or PrestoDB) is an open source, distributed SQL query engine, designed from the ground up for fast analytic queries against data of any size. It supports both non-relational sources, such as the Hadoop Distributed File System (HDFS), Amazon S3, Cassandra, MongoDB, and HBase, and relational data sources such as MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server, and Teradata."
"Presto can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, with most results returning in seconds. You’ll find it used by many well-known companies like Facebook, Airbnb, Netflix, Atlassian, and Nasdaq."

Video Resources:

https://channel9.msdn.com/Shows/Data-Exposed/Unify-your-data-lakes-with-HDFS-tiering

intltechventures.blogspot.com

2020-01-11

2020-01-11 Saturday - Enterprise Data Lake Resources

No comments:

WordCount

Copyright