Background Reading:
- https://en.wikipedia.org/wiki/Data_lake
- "A data lake is a system or repository of data stored in its natural/raw format,[1] usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). [2]"
- "A data swamp is a deteriorated and unmanaged data lake that is either inaccessible to its intended users or is providing little value.[3]"
- https://www.talend.com/resources/what-is-data-lake/
- "A data lake is a central storage repository that holds big data from many sources in a raw, granular format. It can store structured, semi-structured, or unstructured data, which means data can be kept in a more flexible format for future use. When storing data, a data lake associates it with identifiers and metadata tags for faster retrieval."
- "Coined by James Dixon, CTO of Pentaho, the term “data lake” refers to the ad hoc nature of data in a data lake, as opposed to the clean and processed data stored in traditional data warehouse systems."
- "A data lake works on a principle called schema-on-read. This means that there is no predefined schema into which data needs to be fitted before storage. Only when the data is read during processing is it parsed and adapted into a schema as needed. This feature saves a lot of time that’s usually spent on defining a schema. This also enables data to be stored as is, in any format."
- Also see:
- https://www.talend.com/resources/definitive-guide-cloud-data-warehouses/
- https://www.talend.com/blog/2017/11/20/introducing-data-lake-quick-start-talend-amazon-web-services-cognizant/
- The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science 1st Edition
- Mastering Azure Analytics: Architecting in the Cloud with Azure Data Lake, HDInsight, and Spark 1st Edition
- Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake
Additional Suggested Reading:
- Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing 1st Edition
- Spark: The Definitive Guide: Big Data Processing Made Simple 1st Edition
- High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark 1st Edition
- Advanced Analytics with Spark: Patterns for Learning from Data at Scale 2nd Edition
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
- Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale 1st Edition
- An excellent, in-depth treatment...
- Foundations for Architecting Data Solutions: Managing Successful Data Projects 1st Edition
- Big Data: Principles and best practices of scalable realtime data systems 1st Edition (2015)
Amazon AWS Data Lakes:
- https://aws.amazon.com/glue/
- https://aws.amazon.com/kinesis/
- https://aws.amazon.com/kinesis/data-streams/
- https://aws.amazon.com/kinesis/data-firehose/
- https://aws.amazon.com/kinesis/data-analytics/
- https://aws.amazon.com/athena/
Microsoft - Azure:
Microsoft - SQL Server 2019:
- https://www.microsoft.com/en-us/sql-server/sql-server-2019
- https://www.microsoft.com/en-us/sql-server/sql-server-2019-comparison
- https://www.microsoft.com/en-us/sql-server/sql-server-2019-features
- https://www.microsoft.com/en-us/sql-server/sql-server-2019-pricing
- https://www.microsoft.com/en-us/sql-server/sql-server-downloads
- https://docs.microsoft.com/en-us/sql/sql-server/?view=sql-server-ver15
- https://docs.microsoft.com/en-us/sql/sql-server/editions-and-components-of-sql-server-version-15?view=sql-server-ver15
- https://docs.microsoft.com/en-us/sql/sql-server/sql-server-version-15-release-notes?view=sql-server-ver15
- https://docs.microsoft.com/en-us/sql/sql-server/what-s-new-in-sql-server-ver15?view=sql-server-ver15
- https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-guide?view=sql-server-ver15
- "PolyBase enables your SQL Server instance to process Transact-SQL queries that read data from external data sources. SQL Server 2016 and higher can access external data in Hadoop and Azure Blob Storage. Starting in SQL Server 2019, you can now use PolyBase to access external data in SQL Server, Oracle, Teradata, and MongoDB."
- https://cloudblogs.microsoft.com/sqlserver/2018/09/25/introducing-microsoft-sql-server-2019-big-data-clusters/
- "SQL Server and Spark are deployed together with HDFS creating a shared data lake"
- "Data sources that can be integrated by PolyBase in SQL Server 2019"
- https://docs.microsoft.com/en-us/sql/big-data-cluster/big-data-cluster-overview?view=sql-server-ver15
- https://docs.microsoft.com/en-us/sql/big-data-cluster/concept-data-pool?view=sql-server-ver15
- https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-data-pool-ingest-spark?view=sql-server-ver15
- https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-mssql-connector?view=sql-server-ver15
Apache Projects:
- https://flink.apache.org/
- "Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale
- https://aws.amazon.com/blogs/big-data/use-apache-flink-on-amazon-emr/
- https://flume.apache.org/
- "Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application."
- https://hive.apache.org/
- "The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage."
- https://aws.amazon.com/emr/features/hive/
- https://hudi.incubator.apache.org/newsite-content/
- "Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing."
- https://spark.apache.org/
- "Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming."
- https://aws.amazon.com/emr/features/spark/
Other Open Source Projects:
- https://prestodb.io/
- "Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes."
- https://aws.amazon.com/emr/features/presto/
- https://aws.amazon.com/big-data/what-is-presto/
- "Presto (or PrestoDB) is an open source, distributed SQL query engine, designed from the ground up for fast analytic queries against data of any size. It supports both non-relational sources, such as the Hadoop Distributed File System (HDFS), Amazon S3, Cassandra, MongoDB, and HBase, and relational data sources such as MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server, and Teradata."
- "Presto can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, with most results returning in seconds. You’ll find it used by many well-known companies like Facebook, Airbnb, Netflix, Atlassian, and Nasdaq."
Video Resources:
No comments:
Post a Comment