Snowflake Data Cloud
如果无法正常显示,请先停止浏览器的去广告插件。
1. SNOWFLAKE DATA CLOUD
SHENZHEN ARCHSUMMIT 2021
Jiaqi Yan
Principal Software Engineer
© 2020 Snowflake Inc. All Rights Reserved
1
2. QUICK HISTORY
Founded
August 2012
3,500+ active customers,
2000+ Employees
“The Data Cloud” launched
and IPO 2020
Founding Team
January 2013
2015 – GA
1B+ queries daily
© 2020 Snowflake Inc. All Rights Reserved
2018 – Azure port
300+ PB total storage (compressed),
biggest table 68TN rows
2020 – GCP port
1.2k data providers
NPS - 71, industry
average 21
3. WHY SNOWFLAKE?
No Good Solution to Tackle Modern Data Challenges
More Data
Human Generated
Structured
Machine Generated
Semi-Structured
More Users
Few Analysts
Everyone
Faster Answers
Daily or Less Frequent
Real-Time
Interactive
Diverse Data Sources
Few, mostly internal
Many, both internal and external
(suppliers, customers, competitors, public, …)
4. SNOWFLAKE DATA CLOUD
Data Analytics Platform Built for the Cloud Era
•
•
•
•
UNLIMITED SCALE COLLABORATIVE
All Data
All Users
Instant Elasticity
One System • Data Sharing
• Data Services
• Existing Content
- Multi-region
- Multi-cloud
• No Compromise
© 2020 Snowflake Inc. All Rights Reserved
SIMPLICITY
• Self-Managed
Service
• No Tuning Knobs
• Democratize Data
Analytics
5. UNLIMITED SCALE
© 2020 Snowflake Inc. All Rights Reserved
6. TRADITIONAL DATABASE ARCHITECTURES
Limited Scalability, Not Elastic
Shared-nothing
•
•
•
Distributed Storage
Single Cluster
Adopted by Gamma, Teradata,
Redshift, Vertica, Netezza, …
Shared-disk
•
•
•
Centralized Storage
Single Cluster
Adopted by Oracle, Hadoop
7. SNOWFLAKE REGION ARCHITECTURE
Multi-cluster, Shared-data
Virtual Warehouse
Virtual Warehouse
Client(s)
ODBC, JDBC, Web UI,
Python, NodeJS, Spark, …
Cloud Services
Compute/Storage Layers
REST
Cloud
Object
Store
Authentication & access control
Infrastructure
manager
Optimizer
Transaction
Manager
Security
Metadata
Virtual Warehouse
Virtual Warehouse
8. STORAGE TIER
● Immutable Storage
○
○
○
○
○
Cloud Object
Store
Each table is automatically partitioned horizontally
Partition size is kept very small, generally 16MB
Each partition is backed by an immutable file
Partitions are columnar organized, compressed, encrypted
Partitions are the unit of change for transactions
● Semi-structured
○
○
Variant data type used to store schemaless semi-structured data
Automatic columnarization of semi-structured attributes
● Partition Metadata
○
○
○
○
Out-of-box, metadata is automatically stored for all columns/sub-
columns in a partition
Leverage that metadata to perform partition pruning
Re-clustering service to improve pruning
Track all table mutations to provide full ACID support
9. COMPUTE TIER
● Virtual warehouse
○
○
○
○
Snowflake Entity used to manage the set of compute resources used by a workload
Made of one or more compute clusters
Cluster size range from one to several hundred nodes
Workloads are fully isolated from each other
● Just-in-time Compute
○
○
○
○
Sub-second auto-resume when associated workload starts
Online resize up and down possible while workload runs
Auto-suspend when workload is no longer running
Snowflake charges usage by second of compute resource used
➔ FAST is free!
● Partition Cache
○
○
○
Node local memory and SSD storage used to cache partitions
Only columns/sub-columns which are accessed are cached
Highly available, fully stateless
10. CLOUD SERVICES
● Control Plane of a Snowflake Region
REST
Authentication & access control
Cloud
services
Infrastructure
manager
Optimizer
Transaction
Manager
Security
Metadata
○
○
○
○
○
○
Connection Management
Infrastructure Provisioning and Management
Metadata storage (use FDB) & management
Query planning and optimization
Transaction management
Security management
● Self-managed
○
○
○
○
Self-upgrade of both software and hardware
Self-healing: replacement of failed servers and
transparent re-execution of any failed queries
Highly available over multiple availability zone
Stateless: persistent sessions for load-balancing and
transparent fail-over
11. MULTIPLE
WORKLOAD
TYPE
SNOWFLAKE
DATA CLOUD
One Integrated Platform Supporting Multiple Workload Types
Complete SQL
ACID
Low-latency
High-concurrency
UDFs, UDTs
Data Governance
Stored Procedures
© 2020 Snowflake Inc. All Rights Reserved
Streaming Ingest
Tasks
Table Streams
External Functions
Data Pipelines
Semi-structured Data
Unstructured Data
External Tables
Java/Scala/Python
Data Frames
Rest APIs
Real-time
12. SNOWFLAKE DATA CLOUD REGION
© 2020 Snowflake Inc. All Rights Reserved
13. SNOWFLAKE DATA CLOUD (2015)
Single Data Cloud Region (AWS)
First Snowflake Region
AWS-US-WEST
Snowflake Region (AWS)
© 2020 Snowflake Inc. All Rights Reserved
13
14. SNOWFLAKE DATA CLOUD (2021)
22 Data Cloud Regions (10 countries, 3 clouds)
GLOBAL
DATA MESH
Snowflake Region (AWS)
Snowflake Region (Azure)
Snowflake Region (GCP)
© 2020 Snowflake Inc. All Rights Reserved
14
15. BUILDING A LARGE-SCALE GLOBAL SERVICE
Lessons Learned
Way harder than anticipated…
•
•
•
•
•
•
Customers expect at least 3+ 9’s of availability, 24x7
At large scale, anything will happen. Hence we need to proactively anticipate and defend
Everything needs to be fully automated and fully adaptive
As much as possible self-managed versus dev-ops automation
Keeping up with exponential growth ➔ scale cloud services and removing bottlenecks
Weekly release without introducing (visible) regressions
… but so much faster development cycles
•
•
•
•
We have built a top-notch and feature rich platform in only few years!
Weekly release worldwide with single version to maintain
Virtuous cycle – data driven development to identify and prioritize feature development
•
For example, use focus on improving DMLs and transaction processing since dominates
Snowflake platform is extensively instrumented ➔ we generate many terabytes of service data daily
© 2020 Snowflake Inc. All Rights Reserved
15
16. COLLABORATIVE
© 2020 Snowflake Inc. All Rights Reserved
17. DATA COLLABORATION
Traditional Way
Data providers
1.
2.
3.
Export data to files
Publish schema
Stage files for transport
© 2017 Snowflake Computing Inc. All Rights Reserved.
Data customers
•
•
•
•
•
Redundant
Inflexible
Inefficient
Insecure
Expensive
1.
2.
3.
Additional infrastructure
Forced to recreate data structure
Delayed updates to data
18. SNOWFLAKE DATABASE SHARING
Provider Account
Consumer Account(s)
Warehouse(s)
Sharing code
SELECT
Execute
… SP; Cross-database/account
join
FROM
;
CREATE DATABASE DB1
FROM SHARE SH1;
CREATE SHARE SH1;
GRANT … TO SHARE SH1
….;
© 2019 Snowflake Inc. All Rights Reserved
SH1
18
19. SNOWFLAKE DATA MARKETPLACE
READY TO USE DATABASES FROM MULTIPLE PROVIDERS
Live, ready-to-query data; no copying or moving
Only data marketplace with personalized data
Globally available, across clouds
Financial
Marketing
© 2020 Snowflake Inc. All Rights Reserved
Demographic
Macroeconomic
Government
Healthcare
Business
19
20. CONNECTED THROUGH DATA
INDUSTRY
Covid-19 Database
(from Starschema)
Adtech / Marketing
Energy / Utilities
Financial Services
Healthcare
Hospitality
Manufacturing
Media / Entertainment
Other
Retail
Technology
© 2020
©
2020 Snowflake
Snowflake Inc.
Inc. All
All Rights
Rights Reserved
Reserved
20
21. SNOWFLAKE DATABASE SHARING
Conclusion
Secure
© 2019 Snowflake Inc. All Rights Reserved
Live
Frictionless
Personalized
Global
21
22. SIMPLICITY
© 2020 Snowflake Inc. All Rights Reserved
23. WHY SIMPLICITY MATTERS
Manage Data, Not Infrastructure!
Infrastructure Physical
Design Data
Collaboration Query
Tuning
Availability
Initial Setup Partitioning Loading Statistic Collection Setup High availability
Upgrading Indexing Moving Memory Management Handle Hardware Faults
Patching Ordering Transforming Parallelism Capacity
Planning Vacuuming Copying Query Plan Hinting Securing Workload Management
Manage Backups
Storage
Security
© 2020 Snowflake Inc. All Rights Reserved
23
24. SNOWFLAKE CLOUD DATA PLATFORM
Minimal Administration
Infrastructure Physical
Design Data
Collaboration Query
Tuning Availability
Initial Setup Partitioning Loading Statistic Collection Replication
Upgrading Indexing Moving Memory Management Backups
Patching Ordering Transforming Parallelism Re-Clustering
Capacity
Planning Vacuuming Copying Query Plan Hinting Account Management
Securing Workload Management
Simply load/share data and run queries
Storage
Security
© 2020 Snowflake Inc. All Rights Reserved
24
25. Infrastructure
Initial Setup NA - Service is always on
Upgrading Automatic – Performed weekly
by Cloud Services
NA - Just-in-time compute
Patching
Capacity Planning
Storage
Security
© 2020 Snowflake Inc. All Rights Reserved
NA - unlimited storage, spill to blob
storage
Automatic – Encryption, Monitoring, ...
25
26. Physical Design
Partitioning NA – automatic at load time
Ordering Default/Automatic Clustering
Indexing
Vacuuming
© 2020 Snowflake Inc. All Rights Reserved
Search optimization service
NA – Immutable partitions
26
27. Data Collaboration
Loading
Moving
Transforming
Live and Secure Data Sharing
Copying
Securing
© 2020 Snowflake Inc. All Rights Reserved
27
28. Query Tuning
Statistic Collection
Memory Management
Parallelism
Query Plan Hinting
Workload Management
Automatic – at DML time
Automatic – Cooperative Memory
Brokering
Automatic – Adaptive
Robust adaptive execution strategy
dynamic join filters, adaptive push down and distribution
methods, join skew resilience
Virtual warehouse per workload
Auto-scale multi-cluster warehouse
© 2020 Snowflake Inc. All Rights Reserved
28
29. Availability
Setup High Availability
Out-of-box: Snowflake Architecture Multi-AZ
Disaster Recovery: Cross-region Replication
Handle Hardware Faults Automatic: Snowflake Cloud Services detects
and replace faulty hardware
Backups Automatic: blob storage with 11 9’s durability,
Undrop, Clone as-of, time travel, Fail-Safe
© 2020 Snowflake Inc. All Rights Reserved
29
30. CONCLUSION
© 2020 Snowflake Inc. All Rights Reserved
31. SNOWFLAKE DATA CLOUD
Worldwide Web of Data
Simplicity
Single System
© 2020 Snowflake Inc. All Rights Reserved
Collaborative
32. THANK YOU
© 2020 Snowflake Inc. All Rights Reserved