elasticsearch sizing and capacity planning
如果无法正常显示,请先停止浏览器的去广告插件。
1. Elasticsearch
Sizing and Capacity Planning
Dave Moore, Principal Solutions Architect
November 2019
1
2. Let’s
sizin make
g sim
ple!
Dave Moore
Principal Solutions Architect
2
3. Housekeeping & Logistics
• Attendees are automatically muted when joining Zoom
• Q+A will be at the end of the webinar
• Ask questions for us in the Zoom chat during the webinar
◦ Chat settings to: All panelists and attendees
◦ Ask more questions on our discuss forum: discuss.elastic.co
3
•
Recording will be available after the webinar and emailed to all registrants
4. Why Elastic?
4
5. Why Elastic?
5
SCALE
SPEED
RELEVANCE
Distributed by design Find matches in milliseconds Get highly relevant results
6. Search is a Foundation
7. Elastic Stack
SOLUTIONS
Elastic Stack
Kibana Visualize &
Manage
Elasticsearch Store, Search, &
Analyze
Beats
SaaS
Elastic cloud
7
Logstash
SELF-MANAGED
Elastic cloud
Enterprise
Standalone
Ingest
8. Elastic Deployment Models
Elastic Managed + Orchestration
Self-Managed + Orchestration
Self-Managed
Elastic Cloud Elastic Cloud
Self-Managed
Elasticsearch Service Enterprise The official fully managed
Elastic Stack solution.
Orchestrate the
Elastic Stack on your
infrastructure.
Download and administer
the Elastic Stack on your
infrastructure.
Available on AWS, GCP, and Azure. Deploy anywhere. Deploy anywhere.
9. Webinar Overview
10. Overview
Let’s master the art of capacity planning for Elasticsearch.
Elasticsearch is the heart of the Elastic Stack.
Any production deployment of the Elastic Stack should be guided by capacity planning for Elasticsearch.
Whether you use it for logs, metrics, traces, or search, and whether you run it yourself or in our cloud,
you need to plan the infrastructure and configuration of Elasticsearch to ensure the health and
performance of your deployment.
11. Overview
Let’s master the art of capacity planning for Elasticsearch.
Webinar Goals
Capacity planning is about estimating the type and amount of resources required to operate
an Elasticsearch deployment. By the end of this webinar you will know:
• Basic computing resources
• Architecture, behaviors, and resource demands of Elasticsearch
• Methodologies to estimate the requirements of an Elasticsearch deployment
12. Preface
Computing Resources
13. The Four Basic Computing Resources
1.0
Storage Memory Compute Network
Where data persists Where data is buffered Where data is processed Where data is transferred
e.g. Words in a book e.g. Words you read e.g. Analyzing the words e.g. Speaking the words
Computing Resources
14. Storage Resources
Storage
Where data persists
Storage Media Storage Attachment
Solid State Drives (SSDs) offer best Recommendations
performance for “hot” workloads. • Direct Attached Storage (DAS)
Hard Disk Drives (HDDs) are economic
for “warm” and “frozen” storage.
e.g. Words in a book
RAID0 can improve performance.
RAID is optional as Elastic defaults to N+1 shard replication.
Standard performant RAID configurations are acceptable for
hardware level high-availability (e.g. RAID 1/10/50 etc.)
• Storage Area Network (SAN)
• Hyperconverged
(Recommended minimum ~ 3Gb/s, 250Mb/s)
Avoid
• Network Attached Storage (NAS)
e.g. SMB, NFS, AFP. Network protocol overhead, latency, and
costly storage abstraction layers make this a poor choice for
1.1
Computing Resources
Elasticsearch.
15. Memory Resources
Memory
How Elasticsearch Uses Memory
JVM Heap
Stores metadata about the cluster, indices, shards, segments, and
fielddata. This should be 50% of available RAM, and up to a
maximum of 30GB RAM to avoid garbage collection.
Where data is buffered
OS Cache
e.g. Words you read
Elasticsearch will use the remainder of available memory to cache data,
improving performance dramatically by avoiding disk reads during full-
text search, aggregations on doc values, and sorts.
1.2
Computing Resources
16. Compute Resources
Compute
How Elasticsearch Uses Compute
Elasticsearch processes data in many ways that can be computationally expensive.
Elasticsearch nodes have thread pools and thread queues that utilize the available
compute resources. The quantity and performance of CPU cores governs the average
Where data is processed
e.g. Analyzing the words
1.3
Computing Resources
speed and peak throughput of data operations in Elasticsearch.
17. Network Resources
Network
How Elasticsearch Uses Network
Bandwidth is rarely a resource that constrains Elasticsearch. For very large
deployments, the amount of data transfer for ingest, search, or replication between
nodes can cause network saturation. In these cases, network connectivity can be
Where data is transferred
e.g. Speaking the words
1.4
Computing Resources
upgraded to higher speeds, or the Elastic deployment can be split into two or more
clusters and then searched as a single logical unit using cross-cluster search (CCS).
18. Elasticsearch
Architecture
19. Terminology
2.0
Cluster A group of nodes that work together to operate Elasticsearch.
Node A Java process that runs the Elasticsearch software.
Index A group of shards that form a logical data store.
Shard A Lucene index that stores and processes a portion of an Elasticsearch index.
Segment A Lucene segment that immutably stores a portion of a Lucene index.
Document A record that is submitted to and retrieved from an Elasticsearch index.
Elasticsearch Architecture
20. Architecture
Elasticsearch Cluster
Elastic Stack
Master
Master
Data Nodes
Master
/var/lib/elasticsearch/data
Coordinator
Ingest
Data
Data
Data
Kibana
Elasticsearch
Beats
Coordinator
Ingest
Data
Index 2 Replica 1
Index 1 Replica 2
Segment
Segment
Segment
Segment
/var/lib/elasticsearch/data
Ingest
Data
Machine
Learning
Elasticsearch Architecture
Shard 1
Data
Logstash
Coordinator
2.1
Data
Index 1
Data
Machine
Learning
Data
Machine
Learning
Index 1 Replica 1
Index 2 Shard 1
Index 1 Shard 2
Segment
Segment
Segment
Segment
21. Nodes
Role
2.2
Description
Resources
Storage Memory Compute Network
Data Indexes, stores, and searches data Extreme High High Medium
Master Manages cluster state Low Low Low Low
Ingest Transforms inbound data Low Medium High Medium
Machine Learning Processes machine learning models Low Extreme Extreme Medium
Coordinator Delegates requests and merges search results Low Medium Medium Medium
Elasticsearch Architecture
22. Elasticsearch
Data Operations
23. The Four Basic Data Operations
There are four basic data operations in Elasticsearch. Each operation has its own resource demands.
Every use case makes use of these operations, but they will favor some operations over others.
3.0
Index Processing a document and storing it in an index for future retrieval.
Delete Removing a document from an index.
Update Removing a document and indexing a replacement document.
Search Retrieving one or more documents or aggregates from one or more indices.
Elasticsearch Data Operations
24. Index Operations: Data Processing Flow
Elasticsearch Cluster
Lucene Shard
Client
PUT
Coordinator
Ingest
Pipeline?
No
Route
Data
Node
Parse
Text?
Segment
No
Buffer
Flush
Yes
Yes
Analyze
Replicate
Commit
Lucene Replica
Ingest
Node
Data
Node
Parse
Text?
Segment
No
Buffer
Flush
Yes
Analyze
Network
3.1
Elasticsearch Data Operations
Compute
Network
Compute
Commit
/var/lib/elasticsearch/data
Index 1 Shard 1
Index 2 Replica 1
Index 1 Replica 2
/var/lib/elasticsearch/data
Index 1 Replica 1
Index 2 Shard 1
Index 1 Shard 2
Storage
25. Delete Operations: Data Processing Flow
Elasticsearch Cluster
Lucene Shard
Client
DELETE
Coordinator
Route
Data
Node
Mark as Deleted
Segment
Buffer
Commit
Replicate
Lucene Replica
Data
Node
Flush
Mark as Deleted
Segment
Buffer
Flush
Commit
Network
3.2
Elasticsearch Data Operations
/var/lib/elasticsearch/data
Index 1 Shard 1
Index 2 Replica 1
Index 1 Replica 2
/var/lib/elasticsearch/data
Index 1 Replica 1
Index 2 Shard 1
Index 1 Shard 2
Storage
26. Update Operations: Data Processing Flow
Documents are immutable in Elasticsearch. When Elasticsearch updates a document, it deletes the
original document and indexes the new, updated document. 1 The two operations are performed
atomically in each Lucene shard. 3 This incurs the costs of a delete and index operation, except it does
not invoke any ingest pipelines.
Update = Delete + (Index - Ingest Pipeline)
3.3
Elasticsearch Data Operations
27. Search Operations
“Search” is a generic term for information retrieval. Elasticsearch has various retrieval capabilities,
including but not limited to full-text searches, range searches, scripted searches, and aggregations. Search
speed and throughput are affected by many variables including the configurations of the cluster, indices,
queries, and hardware. Realistic capacity planning depends on empirical testing after applying the best
practices for optimizing those configurations.
Elasticsearch executes searches in phases known informally as scatter, search, gather, and merge.
3.4
Elasticsearch Data Operations
28. Search Operations: Data Processing Flow
Elasticsearch Cluster
Scatter
Client
GET
Coordinator
Route
/var/lib/elasticsearch/data
Lucene Shard
Data
Node
Parse
Text?
No
Search
Segment
Segment
Yes
Segment
Analyze
Respond
Merge
Parse
Text?
No
Elasticsearch Data Operations
Network
Search
Segment
Segment
Analyze
3.5
Index 2 Replica 1
Index 1 Replica 2
/var/lib/elasticsearch/data
Lucene Shard
Yes
Memory
Shard 1
Gather
Data
Node
Network
Index 1
Compute
Segment
Index 1 Replica 1
Index 2 Shard 1
Index 1 Shard 2
Storage
29. Use Cases
There are a few conventional usage patterns of Elasticsearch. Each favors one of the basic operations.
Index Heavy Use cases that favor index operations Logging, Metrics, Security, APM
Search Heavy Use cases that favor search operations App Search, Site Search, Analytics
Update Heavy Use cases that favor update operations Caching, Systems of Record
Hybrid Use cases that favor multiple operations Transactions Search
We will review the sizing methodologies for these use cases later in the workshop.
3.6
Elasticsearch Data Operations
30. Elasticsearch
Indexing Behaviors
31. Overview
The following processes are applied to data on ingest.
4.0
JSON Conversion Data can be larger or smaller on disk due to the format it is stored in.
Indexing Data can be processed and indexed in various structures.
Compression Data can be compressed for greater storage efficiency.
Replication Data can be copied for greater fault tolerance and search throughput.
Elasticsearch Indexing Behaviors
32. JSON Conversion
A Verbose Syntax
Original
47 Bytes
Elasticsearch stores the original document in the _source field in
JSON format. JSON is more verbose than common delimited formats
such as CSV, because each value is paired with the name of the field.
The size of a log record from a delimited file could double or more. By
contrast, JSON is less verbose than some formats such as XML.
2018-02-14T12:30:45 192.168.1.1 200 /index.html
JSON
89 Bytes
{
"timestamp": "2018-02-14T12:30:45",
"ip": "192.168.1.1",
It’s Optional
"response": 200,
Logging use cases require the _source field to return the source of
truth for an event. Metrics use cases can discard the _source field
because analysis is always done on aggregations of indexed fields, with
no single record being important to look at.
4.1
Elasticsearch Indexing Behaviors
"url": "/index.html"
}
33. Indexing
Data Structures
Original
4 Values
Elasticsearch indexes values in various data structures. Each data type
has its own storage characteristics.
Many Ways to Index
Indexed
6 Values
Some values can be indexed in multiple ways. String values are often date 2018-02-14T12:30:45
indexed twice – once as a keyword for aggregations and once as text keyword 192.168.1.1
for full-text search. Values prone to error and ambiguity such as names text 1:2 168:1 192:1
integer 200
keyword /index.html
text index:1 html:1
and addresses can be indexed in multiple ways to support different
search strategies.
4.2
2018-02-14T12:30:45 192.168.1.1 200 /index.html
Elasticsearch Indexing Behaviors
34. Compression
Elasticsearch can compress data using one of two different compression algorithms: LZ4 (the default) and
DEFLATE, which saves up to 15% additional space at the expense of added compute time compared to
LZ4. Typically Elasticsearch can compress data by 20 – 30%.
4.3
Elasticsearch Indexing Behaviors
35. Shard Replication
Storage
Elasticsearch can replicate shards once or multiple times across data nodes to improve fault tolerance
and search throughput. Each replica shard is a full copy of its primary shard.
Index and Search Throughput
Logging and metrics use cases typically have one replica shard, which is the minimum to ensure fault
tolerance while minimizing the number of writes. Search use cases often have more replica shards to
increase search throughput.
4.4
Elasticsearch Indexing Behaviors
36. Complete Example
What you sent
2018-02-14T12:30:45 192.168.1.1 200 /index.html
What was stored
Primary
Replica 1
Replica n
4.5
_source {"timestamp":"2018-02-14T12:30:45","ip":"192.168.1.1","response":200,"url":"/index.html"} ➔ Compression
Indexed values 2018-02-14T12:30:45|192.168.1.1|1:2 168:1 192:1|200|/index.html|index:1 html:1
_source {"timestamp":"2018-02-14T12:30:45","ip":"192.168.1.1","response":200,"url":"/index.html"} ➔ Compression
Indexed values 2018-02-14T12:30:45|192.168.1.1|1:2 168:1 192:1|200|/index.html|index:1 html:1
➔ Compression
… …
…
Elasticsearch Indexing Behaviors
➔ Compression
37. Elasticsearch
Sizing Methodologies
38. Sizing Methodologies
There are two basic sizing methodologies that span the major use cases of Elasticsearch.
Volume
Estimating the storage and memory resources required to store the expected amount of
data and shards for each tier of the cluster.
Throughput Estimating the memory, compute, and network resources required to process the expected
operations at the expected latencies and throughput for each operation and for each
tier of the cluster.
5.0
Elasticsearch Sizing Methodologies
39. Volume Sizing: Data Volume
Discovery Questions
Constants
• How much raw data (GB) will you index per day? • Reserve +15% to stay under the disk watermarks.
• How many days will you retain the data? • Reserve +5% for margin of error and background activities.
• What is the net expansion factor of the data?
• Reserve the equivalent of a data node to handle failure.
JSON Factor * Indexing Factor * Compression Factor
5.1
• How many replica shards will you enforce?
• How much memory will you allocate per data node?
• What will be your memory:data ratio?
Total Data (GB) = Raw data (GB) per day * Number of days retained * Net expansion factor * (Number of replicas + 1)
Total Storage (GB) = Total Data (GB) * (1 + 0.15 Disk watermark threshold + 0.05 Margin of error)
Total Data Nodes = ROUNDUP(Total Storage (GB) / Memory per data node / Memory:data ratio) + 1 Data node for failover capacity
Elasticsearch Sizing Methodologies
40. Volume Sizing: Shard Volume
Discovery Questions
Constants
• How many index patterns will you create? • Do not exceed 20 shards per GB of JVM Heap.
• How many primary and replica shards will you configure? • Do not exceed 50GB per shard.
• At what time interval will you rotate the indices, if at all? • How long will you retain the indices? • How much memory will you allocate per data node?
Tip Collapse small daily indices into weekly or monthly indices to
reduce shard count. Split large (>50GB) daily indices into hourly
indices or increase the number of primary shards.
5.2
Total Shards = Number of index patterns * Number of primaries * (Number of replicas + 1) * Total intervals of retention
Total Data Nodes = ROUNDUP(Total shards / (20 * Memory per data node))
Elasticsearch Sizing Methodologies
41. Throughput Sizing: Search Operations
Search use cases have targets for search response time and search throughput in addition to the
storage capacity. These targets can demand more memory and compute resources.
Too many variables affect search response time to predict how any given capacity plan will affect it. But by
empirically testing search response time and estimating the expected search throughput, we can estimate
the required resources of the cluster to meet those demands.
5.3
Elasticsearch Sizing Methodologies
42. Throughput Sizing: Search Operations
Discovery Questions
Theory of the Approach
• What is your peak number of searches per second? Rather than determine how resources will affect search speed, treat
• What is your average search response time in milliseconds? search speed as a constant by measuring it on your planned hardware.
• How many cores and threads per core are on your data nodes? Then determine how many cores are needed in the cluster to process
the expected peak search throughput. Ultimately the goal is to prevent
the thread pool queues from growing faster than they are consumed.
With insufficient compute resources, search requests risk being
dropped.
5.4
Peak Threads = ROUNDUP(Peak searches per second * Average search response time in milliseconds / 1000 Milliseconds)
Thread Pool Size = ROUNDUP((Physical cores per node * Threads per core * 3 / 2) + 1)
Total Data Nodes = ROUNDUP(Peak threads / Thread pool size)
Elasticsearch Sizing Methodologies
43. Hot, Warm, Frozen
Elasticsearch can use shard allocation awareness to allocate shards on specific hardware.
Index heavy use cases often use this to store indices on Hot, Warm, and Frozen tiers of hardware,
and then schedule the migration of those indices from hot to warm to frozen to deleted or archived.
This is an economical way to store lots of data while optimizing performance for more recent data.
During capacity planning, each tier must be sized independently and then combined.
5.5
Tier Goal Example Storage Example Memory:Storage Ratio
Hot Optimize for search SSD DAS/SAN (>200Gb/s) 1:30
Warm Optimize for storage HDD DAS/SAN (~100Gb/s) 1:160
Frozen Optimize for archives
Elasticsearch Sizing Methodologies
Cheapest DAS/SAN (<100Gb/s)
1:1000+
Beware of recovery failures with this much data per node
44. Dedicated Nodes
Elasticsearch nodes perform one or multiple roles. Often it makes sense to assign one role per node. You
can optimize the hardware for each role and prevent nodes from competing for resources.
Master
Dedicated master nodes help ensure the stability of clusters by preventing
other nodes from consuming any of their resources.
Ingest
Ingest nodes that run many pipelines or use many processors will demand
extra compute resources.
Machine Learning
Machine learning nodes that run many jobs or use many splits, buckets, or
complex aggregations will demand extra memory and compute resources.
Coordinator
5.6
Dedicated coordinating nodes can benefit hybrid use cases by offloading the
merge phase of searches from data nodes that are constantly indexing.
Elasticsearch Sizing Methodologies
45. Overall
A proper sizing takes the following steps:
1.
5.7
For each applicable tier – Hot, Warm, Frozen – determine the largest of the following sizes:
• Data volume
• Shard volume
• Indexing throughput
• Search throughput
2. Combine the sizes of each tier
3. Make decisions on any dedicated nodes – Master, Coordinator, Ingest, Machine Learning
Elasticsearch Sizing Methodologies
46. Additional Resources
47. Empowering Your People
Immersive Learning
Lab-based exercises and knowledge checks to
help master new skills
FOUNDATION
Elastic Training
Solution-based Curriculum
Real-world examples and common use cases
Expertly trained and deeply rooted in everything
Elastic
Performance-based Certification
47
Apply practical knowledge to real-world use
cases, in real-time
SPECIALIZATIONS
Experienced Instructors
LOGGING METRICS APM
ADVANCED
SEARCH SECURITY
ANALYTICS DATA
SCIENCE
48. Elastic Consulting Services
ACCELERATING YOUR PROJECT SUCCESS
PHASE-BASED PACKAGES
Align to project milestones at
any stage in your journey
FLEXIBLE SCOPING
Shifts resource as your
requirements change
EXPERT ADVISORS
Understand your specific
use cases
48
GLOBAL CAPABILITY
Provide expert, trusted
services worldwide
PROJECT GUIDANCE
Ensures your goals and
accelerate timelines
49. Q+A
Additional Resources
Forums
Cloud
Cloud Hardware
Products + Solutions
49
https://discuss.elastic.co
https://www.elastic.co/products/elasticsearch/service
https://www.elastic.co/guide/en/cloud/current/ec-reference-hardware.html
https://www.elastic.co/products