elasticsearch sizing and capacity planning

如果无法正常显示，请先停止浏览器的去广告插件。

相关话题： #Elasticsearch

1. Elasticsearch Sizing and Capacity Planning Dave Moore, Principal Solutions Architect November 2019 1

2. Let’s sizin make g sim ple! Dave Moore Principal Solutions Architect 2

3. Housekeeping & Logistics • Attendees are automatically muted when joining Zoom • Q+A will be at the end of the webinar • Ask questions for us in the Zoom chat during the webinar ◦ Chat settings to: All panelists and attendees ◦ Ask more questions on our discuss forum: discuss.elastic.co 3 • Recording will be available after the webinar and emailed to all registrants

4. Why Elastic? 4

5. Why Elastic? 5 SCALE  SPEED  RELEVANCE  Distributed by design Find matches in milliseconds Get highly relevant results

6. Search is a Foundation

7. Elastic Stack SOLUTIONS Elastic Stack Kibana Visualize & Manage Elasticsearch Store, Search, & Analyze Beats SaaS Elastic cloud 7 Logstash SELF-MANAGED Elastic cloud  Enterprise Standalone Ingest

8. Elastic Deployment Models Elastic Managed + Orchestration Self-Managed + Orchestration Self-Managed Elastic Cloud Elastic Cloud Self-Managed Elasticsearch Service Enterprise The official fully managed Elastic Stack solution.  Orchestrate the Elastic Stack on your infrastructure. Download and administer the Elastic Stack on your infrastructure. Available on AWS, GCP, and Azure. Deploy anywhere. Deploy anywhere.

9. Webinar Overview

10. Overview Let’s master the art of capacity planning for Elasticsearch. Elasticsearch is the heart of the Elastic Stack. Any production deployment of the Elastic Stack should be guided by capacity planning for Elasticsearch. Whether you use it for logs, metrics, traces, or search, and whether you run it yourself or in our cloud,  you need to plan the infrastructure and configuration of Elasticsearch to ensure the health and  performance of your deployment.

11. Overview Let’s master the art of capacity planning for Elasticsearch. Webinar Goals Capacity planning is about estimating the type and amount of resources required to operate  an Elasticsearch deployment. By the end of this webinar you will know: • Basic computing resources • Architecture, behaviors, and resource demands of Elasticsearch • Methodologies to estimate the requirements of an Elasticsearch deployment

12. Preface Computing Resources

13. The Four Basic Computing Resources 1.0 Storage Memory Compute Network Where data persists Where data is buffered Where data is processed Where data is transferred e.g. Words in a book e.g. Words you read e.g. Analyzing the words e.g. Speaking the words Computing Resources

14. Storage Resources Storage Where data persists Storage Media Storage Attachment Solid State Drives (SSDs) offer best Recommendations performance for “hot” workloads. • Direct Attached Storage (DAS) Hard Disk Drives (HDDs) are economic for “warm” and “frozen” storage. e.g. Words in a book RAID0 can improve performance. RAID is optional as Elastic defaults to N+1 shard replication. Standard performant RAID configurations are acceptable for hardware level high-availability (e.g. RAID 1/10/50 etc.) • Storage Area Network (SAN) • Hyperconverged  (Recommended minimum ~ 3Gb/s, 250Mb/s) Avoid • Network Attached Storage (NAS)  e.g. SMB, NFS, AFP. Network protocol overhead, latency, and costly storage abstraction layers make this a poor choice for 1.1 Computing Resources Elasticsearch.

15. Memory Resources Memory How Elasticsearch Uses Memory JVM Heap Stores metadata about the cluster, indices, shards, segments, and fielddata. This should be 50% of available RAM, and up to a maximum of 30GB RAM to avoid garbage collection. Where data is buffered OS Cache e.g. Words you read Elasticsearch will use the remainder of available memory to cache data, improving performance dramatically by avoiding disk reads during full- text search, aggregations on doc values, and sorts. 1.2 Computing Resources

16. Compute Resources Compute How Elasticsearch Uses Compute Elasticsearch processes data in many ways that can be computationally expensive. Elasticsearch nodes have thread pools and thread queues that utilize the available compute resources. The quantity and performance of CPU cores governs the average Where data is processed e.g. Analyzing the words 1.3 Computing Resources speed and peak throughput of data operations in Elasticsearch.

17. Network Resources Network How Elasticsearch Uses Network Bandwidth is rarely a resource that constrains Elasticsearch. For very large deployments, the amount of data transfer for ingest, search, or replication between nodes can cause network saturation. In these cases, network connectivity can be Where data is transferred e.g. Speaking the words 1.4 Computing Resources upgraded to higher speeds, or the Elastic deployment can be split into two or more clusters and then searched as a single logical unit using cross-cluster search (CCS).

18. Elasticsearch Architecture

19. Terminology 2.0 Cluster A group of nodes that work together to operate Elasticsearch. Node A Java process that runs the Elasticsearch software. Index A group of shards that form a logical data store. Shard A Lucene index that stores and processes a portion of an Elasticsearch index. Segment A Lucene segment that immutably stores a portion of a Lucene index. Document A record that is submitted to and retrieved from an Elasticsearch index. Elasticsearch Architecture

20. Architecture Elasticsearch Cluster Elastic Stack Master Master Data Nodes Master /var/lib/elasticsearch/data Coordinator Ingest Data Data Data Kibana Elasticsearch Beats Coordinator Ingest Data Index 2 Replica 1 Index 1 Replica 2 Segment Segment Segment Segment /var/lib/elasticsearch/data Ingest Data Machine  Learning Elasticsearch Architecture Shard 1 Data Logstash Coordinator 2.1 Data Index 1 Data Machine  Learning Data Machine  Learning Index 1 Replica 1 Index 2 Shard 1 Index 1 Shard 2 Segment Segment Segment Segment

21. Nodes Role 2.2 Description Resources Storage Memory Compute Network Data Indexes, stores, and searches data Extreme High High Medium Master Manages cluster state Low Low Low Low Ingest Transforms inbound data Low Medium High Medium Machine Learning Processes machine learning models Low Extreme Extreme Medium Coordinator Delegates requests and merges search results Low Medium Medium Medium Elasticsearch Architecture

22. Elasticsearch Data Operations

23. The Four Basic Data Operations There are four basic data operations in Elasticsearch. Each operation has its own resource demands. Every use case makes use of these operations, but they will favor some operations over others. 3.0 Index Processing a document and storing it in an index for future retrieval. Delete Removing a document from an index. Update Removing a document and indexing a replacement document. Search Retrieving one or more documents or aggregates from one or more indices. Elasticsearch Data Operations

24. Index Operations: Data Processing Flow Elasticsearch Cluster Lucene Shard Client PUT Coordinator Ingest  Pipeline? No Route Data  Node Parse Text? Segment No Buffer Flush Yes Yes Analyze Replicate Commit Lucene Replica Ingest  Node Data  Node Parse Text? Segment No Buffer Flush Yes Analyze Network 3.1 Elasticsearch Data Operations Compute Network Compute Commit /var/lib/elasticsearch/data Index 1 Shard 1 Index 2 Replica 1 Index 1 Replica 2 /var/lib/elasticsearch/data Index 1 Replica 1 Index 2 Shard 1 Index 1 Shard 2 Storage

25. Delete Operations: Data Processing Flow Elasticsearch Cluster Lucene Shard Client DELETE Coordinator Route Data  Node Mark as Deleted Segment Buffer Commit Replicate Lucene Replica Data  Node Flush Mark as Deleted Segment Buffer Flush Commit Network 3.2 Elasticsearch Data Operations /var/lib/elasticsearch/data Index 1 Shard 1 Index 2 Replica 1 Index 1 Replica 2 /var/lib/elasticsearch/data Index 1 Replica 1 Index 2 Shard 1 Index 1 Shard 2 Storage

26. Update Operations: Data Processing Flow Documents are immutable in Elasticsearch. When Elasticsearch updates a document, it deletes the original document and indexes the new, updated document. 1 The two operations are performed atomically in each Lucene shard. 3 This incurs the costs of a delete and index operation, except it does not invoke any ingest pipelines. Update = Delete + (Index - Ingest Pipeline) 3.3 Elasticsearch Data Operations

27. Search Operations “Search” is a generic term for information retrieval. Elasticsearch has various retrieval capabilities, including but not limited to full-text searches, range searches, scripted searches, and aggregations. Search speed and throughput are affected by many variables including the configurations of the cluster, indices, queries, and hardware. Realistic capacity planning depends on empirical testing after applying the best practices for optimizing those configurations. Elasticsearch executes searches in phases known informally as scatter, search, gather, and merge. 3.4 Elasticsearch Data Operations

28. Search Operations: Data Processing Flow Elasticsearch Cluster Scatter Client GET Coordinator Route /var/lib/elasticsearch/data Lucene Shard Data  Node Parse Text? No Search Segment Segment Yes Segment Analyze Respond Merge Parse Text? No Elasticsearch Data Operations Network Search Segment Segment Analyze 3.5 Index 2 Replica 1 Index 1 Replica 2 /var/lib/elasticsearch/data Lucene Shard Yes Memory Shard 1 Gather Data  Node Network Index 1 Compute Segment Index 1 Replica 1 Index 2 Shard 1 Index 1 Shard 2 Storage

29. Use Cases There are a few conventional usage patterns of Elasticsearch. Each favors one of the basic operations. Index Heavy Use cases that favor index operations Logging, Metrics, Security, APM Search Heavy Use cases that favor search operations App Search, Site Search, Analytics Update Heavy Use cases that favor update operations Caching, Systems of Record Hybrid Use cases that favor multiple operations Transactions Search We will review the sizing methodologies for these use cases later in the workshop. 3.6 Elasticsearch Data Operations

30. Elasticsearch Indexing Behaviors

31. Overview The following processes are applied to data on ingest. 4.0 JSON Conversion Data can be larger or smaller on disk due to the format it is stored in. Indexing Data can be processed and indexed in various structures. Compression Data can be compressed for greater storage efficiency. Replication Data can be copied for greater fault tolerance and search throughput. Elasticsearch Indexing Behaviors

32. JSON Conversion A Verbose Syntax Original 47 Bytes Elasticsearch stores the original document in the _source field in JSON format. JSON is more verbose than common delimited formats such as CSV, because each value is paired with the name of the field. The size of a log record from a delimited file could double or more. By contrast, JSON is less verbose than some formats such as XML. 2018-02-14T12:30:45 192.168.1.1 200 /index.html JSON 89 Bytes { "timestamp": "2018-02-14T12:30:45", "ip": "192.168.1.1", It’s Optional "response": 200, Logging use cases require the _source field to return the source of truth for an event. Metrics use cases can discard the _source field because analysis is always done on aggregations of indexed fields, with no single record being important to look at. 4.1 Elasticsearch Indexing Behaviors "url": "/index.html" }

33. Indexing Data Structures Original 4 Values Elasticsearch indexes values in various data structures. Each data type has its own storage characteristics. Many Ways to Index Indexed 6 Values Some values can be indexed in multiple ways. String values are often date 2018-02-14T12:30:45 indexed twice – once as a keyword for aggregations and once as text keyword 192.168.1.1 for full-text search. Values prone to error and ambiguity such as names text 1:2 168:1 192:1 integer 200 keyword /index.html text index:1 html:1 and addresses can be indexed in multiple ways to support different search strategies. 4.2 2018-02-14T12:30:45 192.168.1.1 200 /index.html Elasticsearch Indexing Behaviors

34. Compression Elasticsearch can compress data using one of two different compression algorithms: LZ4 (the default) and DEFLATE, which saves up to 15% additional space at the expense of added compute time compared to LZ4. Typically Elasticsearch can compress data by 20 – 30%. 4.3 Elasticsearch Indexing Behaviors

35. Shard Replication Storage Elasticsearch can replicate shards once or multiple times across data nodes to improve fault tolerance and search throughput. Each replica shard is a full copy of its primary shard. Index and Search Throughput Logging and metrics use cases typically have one replica shard, which is the minimum to ensure fault tolerance while minimizing the number of writes. Search use cases often have more replica shards to increase search throughput. 4.4 Elasticsearch Indexing Behaviors

36. Complete Example What you sent 2018-02-14T12:30:45 192.168.1.1 200 /index.html What was stored Primary Replica 1 Replica n 4.5 _source {"timestamp":"2018-02-14T12:30:45","ip":"192.168.1.1","response":200,"url":"/index.html"} ➔ Compression Indexed values 2018-02-14T12:30:45|192.168.1.1|1:2 168:1 192:1|200|/index.html|index:1 html:1 _source {"timestamp":"2018-02-14T12:30:45","ip":"192.168.1.1","response":200,"url":"/index.html"} ➔ Compression Indexed values 2018-02-14T12:30:45|192.168.1.1|1:2 168:1 192:1|200|/index.html|index:1 html:1 ➔ Compression … … … Elasticsearch Indexing Behaviors ➔ Compression

37. Elasticsearch Sizing Methodologies

38. Sizing Methodologies There are two basic sizing methodologies that span the major use cases of Elasticsearch. Volume Estimating the storage and memory resources required to store the expected amount of data and shards for each tier of the cluster.  Throughput Estimating the memory, compute, and network resources required to process the expected operations at the expected latencies and throughput for each operation and for each tier of the cluster. 5.0 Elasticsearch Sizing Methodologies

39. Volume Sizing: Data Volume Discovery Questions Constants • How much raw data (GB) will you index per day? • Reserve +15% to stay under the disk watermarks. • How many days will you retain the data? • Reserve +5% for margin of error and background activities. • What is the net expansion factor of the data?  • Reserve the equivalent of a data node to handle failure. JSON Factor * Indexing Factor * Compression Factor 5.1 • How many replica shards will you enforce? • How much memory will you allocate per data node? • What will be your memory:data ratio? Total Data (GB) = Raw data (GB) per day * Number of days retained * Net expansion factor * (Number of replicas + 1) Total Storage (GB) = Total Data (GB) * (1 + 0.15 Disk watermark threshold + 0.05 Margin of error) Total Data Nodes = ROUNDUP(Total Storage (GB) / Memory per data node / Memory:data ratio) + 1 Data node for failover capacity Elasticsearch Sizing Methodologies

40. Volume Sizing: Shard Volume Discovery Questions Constants • How many index patterns will you create? • Do not exceed 20 shards per GB of JVM Heap. • How many primary and replica shards will you configure? • Do not exceed 50GB per shard. • At what time interval will you rotate the indices, if at all? • How long will you retain the indices? • How much memory will you allocate per data node? Tip Collapse small daily indices into weekly or monthly indices to  reduce shard count. Split large (>50GB) daily indices into hourly  indices or increase the number of primary shards. 5.2 Total Shards = Number of index patterns * Number of primaries * (Number of replicas + 1) * Total intervals of retention Total Data Nodes = ROUNDUP(Total shards / (20 * Memory per data node)) Elasticsearch Sizing Methodologies

41. Throughput Sizing: Search Operations Search use cases have targets for search response time and search throughput in addition to the storage capacity. These targets can demand more memory and compute resources. Too many variables affect search response time to predict how any given capacity plan will affect it. But by empirically testing search response time and estimating the expected search throughput, we can estimate the required resources of the cluster to meet those demands. 5.3 Elasticsearch Sizing Methodologies

42. Throughput Sizing: Search Operations Discovery Questions Theory of the Approach • What is your peak number of searches per second? Rather than determine how resources will affect search speed, treat • What is your average search response time in milliseconds? search speed as a constant by measuring it on your planned hardware. • How many cores and threads per core are on your data nodes? Then determine how many cores are needed in the cluster to process the expected peak search throughput. Ultimately the goal is to prevent the thread pool queues from growing faster than they are consumed. With insufficient compute resources, search requests risk being dropped. 5.4 Peak Threads = ROUNDUP(Peak searches per second * Average search response time in milliseconds / 1000 Milliseconds) Thread Pool Size = ROUNDUP((Physical cores per node * Threads per core * 3 / 2) + 1) Total Data Nodes = ROUNDUP(Peak threads / Thread pool size) Elasticsearch Sizing Methodologies

43. Hot, Warm, Frozen Elasticsearch can use shard allocation awareness to allocate shards on specific hardware.  Index heavy use cases often use this to store indices on Hot, Warm, and Frozen tiers of hardware,  and then schedule the migration of those indices from hot to warm to frozen to deleted or archived.  This is an economical way to store lots of data while optimizing performance for more recent data.  During capacity planning, each tier must be sized independently and then combined. 5.5 Tier Goal Example Storage Example Memory:Storage Ratio Hot Optimize for search SSD DAS/SAN (>200Gb/s) 1:30 Warm Optimize for storage HDD DAS/SAN (~100Gb/s) 1:160 Frozen Optimize for archives Elasticsearch Sizing Methodologies Cheapest DAS/SAN (<100Gb/s) 1:1000+  Beware of recovery failures with this much data per node

44. Dedicated Nodes Elasticsearch nodes perform one or multiple roles. Often it makes sense to assign one role per node. You can optimize the hardware for each role and prevent nodes from competing for resources. Master Dedicated master nodes help ensure the stability of clusters by preventing  other nodes from consuming any of their resources. Ingest Ingest nodes that run many pipelines or use many processors will demand  extra compute resources. Machine Learning Machine learning nodes that run many jobs or use many splits, buckets, or  complex aggregations will demand extra memory and compute resources. Coordinator 5.6 Dedicated coordinating nodes can benefit hybrid use cases by offloading the  merge phase of searches from data nodes that are constantly indexing. Elasticsearch Sizing Methodologies

45. Overall A proper sizing takes the following steps: 1. 5.7 For each applicable tier – Hot, Warm, Frozen – determine the largest of the following sizes: • Data volume • Shard volume • Indexing throughput • Search throughput 2. Combine the sizes of each tier 3. Make decisions on any dedicated nodes – Master, Coordinator, Ingest, Machine Learning Elasticsearch Sizing Methodologies

46. Additional Resources

47. Empowering Your People Immersive Learning Lab-based exercises and knowledge checks to help master new skills FOUNDATION Elastic Training Solution-based Curriculum Real-world examples and common use cases Expertly trained and deeply rooted in everything Elastic Performance-based Certification 47 Apply practical knowledge to real-world use cases, in real-time SPECIALIZATIONS Experienced Instructors LOGGING METRICS APM ADVANCED SEARCH SECURITY ANALYTICS DATA SCIENCE

48. Elastic Consulting Services ACCELERATING YOUR PROJECT SUCCESS PHASE-BASED PACKAGES Align to project milestones at any stage in your journey FLEXIBLE SCOPING Shifts resource as your requirements change EXPERT ADVISORS Understand your specific use cases 48 GLOBAL CAPABILITY Provide expert, trusted services worldwide PROJECT GUIDANCE Ensures your goals and accelerate timelines

49. Q+A Additional Resources Forums Cloud Cloud Hardware Products + Solutions 49 https://discuss.elastic.co https://www.elastic.co/products/elasticsearch/service https://www.elastic.co/guide/en/cloud/current/ec-reference-hardware.html https://www.elastic.co/products