Snowflake Data Cloud

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
1. SNOWFLAKE DATA CLOUD SHENZHEN ARCHSUMMIT 2021 Jiaqi Yan Principal Software Engineer © 2020 Snowflake Inc. All Rights Reserved 1
2. QUICK HISTORY Founded August 2012 3,500+ active customers, 2000+ Employees “The Data Cloud” launched and IPO 2020 Founding Team January 2013 2015 – GA 1B+ queries daily © 2020 Snowflake Inc. All Rights Reserved 2018 – Azure port 300+ PB total storage (compressed), biggest table 68TN rows 2020 – GCP port 1.2k data providers NPS - 71, industry average 21
3. WHY SNOWFLAKE? No Good Solution to Tackle Modern Data Challenges More Data Human Generated Structured Machine Generated Semi-Structured More Users Few Analysts Everyone Faster Answers Daily or Less Frequent Real-Time Interactive Diverse Data Sources Few, mostly internal Many, both internal and external (suppliers, customers, competitors, public, …)
4. SNOWFLAKE DATA CLOUD Data Analytics Platform Built for the Cloud Era • • • • UNLIMITED SCALE COLLABORATIVE All Data All Users Instant Elasticity One System • Data Sharing • Data Services • Existing Content - Multi-region - Multi-cloud • No Compromise © 2020 Snowflake Inc. All Rights Reserved SIMPLICITY • Self-Managed Service • No Tuning Knobs • Democratize Data Analytics
5. UNLIMITED SCALE © 2020 Snowflake Inc. All Rights Reserved
6. TRADITIONAL DATABASE ARCHITECTURES Limited Scalability, Not Elastic Shared-nothing • • • Distributed Storage Single Cluster Adopted by Gamma, Teradata, Redshift, Vertica, Netezza, … Shared-disk • • • Centralized Storage Single Cluster Adopted by Oracle, Hadoop
7. SNOWFLAKE REGION ARCHITECTURE Multi-cluster, Shared-data Virtual Warehouse Virtual Warehouse Client(s) ODBC, JDBC, Web UI, Python, NodeJS, Spark, … Cloud Services Compute/Storage Layers REST Cloud Object Store Authentication & access control Infrastructure manager Optimizer Transaction Manager Security Metadata Virtual Warehouse Virtual Warehouse
8. STORAGE TIER ● Immutable Storage ○ ○ ○ ○ ○ Cloud Object Store Each table is automatically partitioned horizontally Partition size is kept very small, generally 16MB Each partition is backed by an immutable file Partitions are columnar organized, compressed, encrypted Partitions are the unit of change for transactions ● Semi-structured ○ ○ Variant data type used to store schemaless semi-structured data Automatic columnarization of semi-structured attributes ● Partition Metadata ○ ○ ○ ○ Out-of-box, metadata is automatically stored for all columns/sub- columns in a partition Leverage that metadata to perform partition pruning Re-clustering service to improve pruning Track all table mutations to provide full ACID support
9. COMPUTE TIER ● Virtual warehouse ○ ○ ○ ○ Snowflake Entity used to manage the set of compute resources used by a workload Made of one or more compute clusters Cluster size range from one to several hundred nodes Workloads are fully isolated from each other ● Just-in-time Compute ○ ○ ○ ○ Sub-second auto-resume when associated workload starts Online resize up and down possible while workload runs Auto-suspend when workload is no longer running Snowflake charges usage by second of compute resource used ➔ FAST is free! ● Partition Cache ○ ○ ○ Node local memory and SSD storage used to cache partitions Only columns/sub-columns which are accessed are cached Highly available, fully stateless
10. CLOUD SERVICES ● Control Plane of a Snowflake Region REST Authentication & access control Cloud services Infrastructure manager Optimizer Transaction Manager Security Metadata ○ ○ ○ ○ ○ ○ Connection Management Infrastructure Provisioning and Management Metadata storage (use FDB) & management Query planning and optimization Transaction management Security management ● Self-managed ○ ○ ○ ○ Self-upgrade of both software and hardware Self-healing: replacement of failed servers and transparent re-execution of any failed queries Highly available over multiple availability zone Stateless: persistent sessions for load-balancing and transparent fail-over
11. MULTIPLE WORKLOAD TYPE SNOWFLAKE DATA CLOUD One Integrated Platform Supporting Multiple Workload Types Complete SQL ACID Low-latency High-concurrency UDFs, UDTs Data Governance Stored Procedures © 2020 Snowflake Inc. All Rights Reserved Streaming Ingest Tasks Table Streams External Functions Data Pipelines Semi-structured Data Unstructured Data External Tables Java/Scala/Python Data Frames Rest APIs Real-time
12. SNOWFLAKE DATA CLOUD REGION © 2020 Snowflake Inc. All Rights Reserved
13. SNOWFLAKE DATA CLOUD (2015) Single Data Cloud Region (AWS) First Snowflake Region AWS-US-WEST Snowflake Region (AWS) © 2020 Snowflake Inc. All Rights Reserved 13
14. SNOWFLAKE DATA CLOUD (2021) 22 Data Cloud Regions (10 countries, 3 clouds) GLOBAL DATA MESH Snowflake Region (AWS) Snowflake Region (Azure) Snowflake Region (GCP) © 2020 Snowflake Inc. All Rights Reserved 14
15. BUILDING A LARGE-SCALE GLOBAL SERVICE Lessons Learned Way harder than anticipated… • • • • • • Customers expect at least 3+ 9’s of availability, 24x7 At large scale, anything will happen. Hence we need to proactively anticipate and defend Everything needs to be fully automated and fully adaptive As much as possible self-managed versus dev-ops automation Keeping up with exponential growth ➔ scale cloud services and removing bottlenecks Weekly release without introducing (visible) regressions … but so much faster development cycles • • • • We have built a top-notch and feature rich platform in only few years! Weekly release worldwide with single version to maintain Virtuous cycle – data driven development to identify and prioritize feature development • For example, use focus on improving DMLs and transaction processing since dominates Snowflake platform is extensively instrumented ➔ we generate many terabytes of service data daily © 2020 Snowflake Inc. All Rights Reserved 15
16. COLLABORATIVE © 2020 Snowflake Inc. All Rights Reserved
17. DATA COLLABORATION Traditional Way Data providers 1. 2. 3. Export data to files Publish schema Stage files for transport © 2017 Snowflake Computing Inc. All Rights Reserved. Data customers • • • • • Redundant Inflexible Inefficient Insecure Expensive 1. 2. 3. Additional infrastructure Forced to recreate data structure Delayed updates to data
18. SNOWFLAKE DATABASE SHARING Provider Account Consumer Account(s) Warehouse(s) Sharing code SELECT Execute … SP; Cross-database/account join FROM ; CREATE DATABASE DB1 FROM SHARE SH1; CREATE SHARE SH1; GRANT … TO SHARE SH1 ….; © 2019 Snowflake Inc. All Rights Reserved SH1 18
19. SNOWFLAKE DATA MARKETPLACE READY TO USE DATABASES FROM MULTIPLE PROVIDERS Live, ready-to-query data; no copying or moving Only data marketplace with personalized data Globally available, across clouds Financial Marketing © 2020 Snowflake Inc. All Rights Reserved Demographic Macroeconomic Government Healthcare Business 19
20. CONNECTED THROUGH DATA INDUSTRY Covid-19 Database (from Starschema) Adtech / Marketing Energy / Utilities Financial Services Healthcare Hospitality Manufacturing Media / Entertainment Other Retail Technology © 2020 © 2020 Snowflake Snowflake Inc. Inc. All All Rights Rights Reserved Reserved 20
21. SNOWFLAKE DATABASE SHARING Conclusion Secure © 2019 Snowflake Inc. All Rights Reserved Live Frictionless Personalized Global 21
22. SIMPLICITY © 2020 Snowflake Inc. All Rights Reserved
23. WHY SIMPLICITY MATTERS Manage Data, Not Infrastructure! Infrastructure Physical Design Data Collaboration Query Tuning Availability Initial Setup Partitioning Loading Statistic Collection Setup High availability Upgrading Indexing Moving Memory Management Handle Hardware Faults Patching Ordering Transforming Parallelism Capacity Planning Vacuuming Copying Query Plan Hinting Securing Workload Management Manage Backups Storage Security © 2020 Snowflake Inc. All Rights Reserved 23
24. SNOWFLAKE CLOUD DATA PLATFORM Minimal Administration Infrastructure Physical Design Data Collaboration Query Tuning Availability Initial Setup Partitioning Loading Statistic Collection Replication Upgrading Indexing Moving Memory Management Backups Patching Ordering Transforming Parallelism Re-Clustering Capacity Planning Vacuuming Copying Query Plan Hinting Account Management Securing Workload Management Simply load/share data and run queries Storage Security © 2020 Snowflake Inc. All Rights Reserved 24
25. Infrastructure Initial Setup NA - Service is always on Upgrading Automatic – Performed weekly by Cloud Services NA - Just-in-time compute Patching Capacity Planning Storage Security © 2020 Snowflake Inc. All Rights Reserved NA - unlimited storage, spill to blob storage Automatic – Encryption, Monitoring, ... 25
26. Physical Design Partitioning NA – automatic at load time Ordering Default/Automatic Clustering Indexing Vacuuming © 2020 Snowflake Inc. All Rights Reserved Search optimization service NA – Immutable partitions 26
27. Data Collaboration Loading Moving Transforming Live and Secure Data Sharing Copying Securing © 2020 Snowflake Inc. All Rights Reserved 27
28. Query Tuning Statistic Collection Memory Management Parallelism Query Plan Hinting Workload Management Automatic – at DML time Automatic – Cooperative Memory Brokering Automatic – Adaptive Robust adaptive execution strategy dynamic join filters, adaptive push down and distribution methods, join skew resilience Virtual warehouse per workload Auto-scale multi-cluster warehouse © 2020 Snowflake Inc. All Rights Reserved 28
29. Availability Setup High Availability Out-of-box: Snowflake Architecture Multi-AZ Disaster Recovery: Cross-region Replication Handle Hardware Faults Automatic: Snowflake Cloud Services detects and replace faulty hardware Backups Automatic: blob storage with 11 9’s durability, Undrop, Clone as-of, time travel, Fail-Safe © 2020 Snowflake Inc. All Rights Reserved 29
30. CONCLUSION © 2020 Snowflake Inc. All Rights Reserved
31. SNOWFLAKE DATA CLOUD Worldwide Web of Data Simplicity Single System © 2020 Snowflake Inc. All Rights Reserved Collaborative
32. THANK YOU © 2020 Snowflake Inc. All Rights Reserved

Home - Wiki
Copyright © 2011-2024 iteam. Current version is 2.139.0. UTC+08:00, 2024-12-23 20:48
浙ICP备14020137号-1 $Map of visitor$