The Kubernetes Storage Layer
如果无法正常显示,请先停止浏览器的去广告插件。
1.
2. The Kubernetes Storage Layer:
Peeling The Onion Minus The Tears
Madhav Jivrajani, VMware
3. $ whoami
● Work @ VMware
● Do work in API Machinery, Scalability, Architecture and ContribEx
● TL for SIG ContribEx and GitHub Admin of the project
4. Before We Start…
5. 🚨 Help migrate Prow jobs to community clusters!
See https://github.com/kubernetes/test-infra/issues/29722 for details.
6. Prelude
A 50,000 ft. view of how the Kubernetes “machine” works.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29. List + Watch
30. List + Watch
“Kubernetes is a declarative, event-driven system.”
31. List + Watch
“Kubernetes is a declarative, event-driven system.”
32. List
“Kubernetes is a declarative, event-driven system.”
33. List
“Kubernetes is a declarative, event-driven system.”
We specify intent.
❯ kubectl apply -f 3-replica-deployment.yaml
34. List
“Kubernetes is a declarative, event-
driven system.”
35. List
“Kubernetes is a declarative, event-
driven system.”
● We need to start somewhere, in order to take
actions, we need to know what the “current
state” looks like.
36. List
“Kubernetes is a declarative, event-
driven system.”
❯ kubectl get --raw
'/api/v1/namespaces/default/pods'
{
"kind": "PodList",
● We need to start somewhere, in order to take
"apiVersion": "v1",
actions, we need to know what the “current
"metadata": {
state” looks like.
"resourceVersion":"1452",
● To do this, we perform a LIST operation.
...
},
"items": [...] // all pods
}
37. List
“Kubernetes is a declarative, event-
driven system.”
❯ kubectl get --raw
'/api/v1/namespaces/default/pods?limit=100'
{
"kind": "PodList",
● In order to get the “current state”, we perform a
"apiVersion": "v1",
LIST operation.
"metadata": {
● Responses can get huge, sometimes we paginate.
"resourceVersion":"1452",
"continue": "ENCODED_CONTINUE_TOKEN",
...
},
"items": [...] // pod0-pod99
}
38. List
“Kubernetes is a declarative, event-
driven system.”
● In order to get the “current state”, we perform a
❯ kubectl get --raw
'/api/v1/namespaces/default/pods?limit=100&cont
inue=ENCODED_CONTINUE_TOKEN'
{
"kind": "PodList",
LIST operation.
"apiVersion": "v1",
● Responses can get huge, sometimes we paginate.
"metadata": {
● We can continue doing this till we get the entire
"resourceVersion":"1452",
"continue": "ENCODED_CONTINUE_TOKEN_2",
“current state” (full list).
...
},
"items": [...] // pod100-pod199
}
39. Watch
“Kubernetes is a declarative, event-
driven system.”
40. Watch
“Kubernetes is a declarative, event-
driven system.”
41. Watch
“Kubernetes is a declarative, event-
driven system.”
42. Watch
“Kubernetes is a declarative, event-
driven system.”
43. Watch
“Kubernetes is a declarative, event-
driven system.”
https://www.mgasch.com/2018/08/k8sevents/
44. Watch
“Kubernetes is a declarative, event-
driven system.”
❯ kubectl get --raw
'/api/v1/namespaces/default/pods?limit=100&cont
inue=ENCODED_CONTINUE_TOKEN_2'
● I have my state of the world from LIST. Now I need {
to know as and when events happen that modify "kind": "PodList",
this state so that I can take corrective action. "apiVersion": "v1",
"metadata": {
"resourceVersion":"1452",
"continue": "ENCODED_CONTINUE_TOKEN_2",
...
},
"items": [...] // pod100-pod199
}
45. Watch
“Kubernetes is a declarative, event-
driven system.”
❯ kubectl get --raw
'/api/v1/namespaces/default/pods?limit=100&cont
inue=ENCODED_CONTINUE_TOKEN_2'
● I have my state of the world from LIST. Now I need {
to know as and when events happen that modify "kind": "PodList",
this state so that I can take corrective action. "apiVersion": "v1",
"metadata": {
"resourceVersion":"1452",
"continue": "ENCODED_CONTINUE_TOKEN_2",
...
},
"items": [...] // pod100-pod199
}
46. Watch
“Kubernetes is a declarative, event-
driven system.”
● I have my state of the world from LIST. Now I need
❯ kubectl get --raw
'/api/v1/namespaces/default/pods?
watch=1&resourceVersion=1452'
{
"type": "MODIFIED",
to know as and when events happen that modify
"object": {
"kind": "Pod", "apiVersion": "v1",
this state so that I can take corrective action.
● WATCH for changes. The API Server gives us a
stream of notifications on a single connection that
"metadata": {"resourceVersion":"1650", ...}, ...}
}
...
{
we can “react” to.
"type": "DELETED",
"object": {
"kind": "Pod", "apiVersion": "v1",
"metadata": {"resourceVersion":"1734", ...}, ...}
}
47. resourceVersion
48. resourceVersion
● Opaque string representing “internal version” of an object.
● One big, global, logical clock.
49. resourceVersion
● Opaque string representing “internal version” of an object.
● One big, global, logical clock.
● resourceVersion is backed by etcd’s store revisions* – which provide a global ordering.
● Increases monotonically whenever any change to the state of the world happens.
50. resourceVersion
● Opaque string representing “internal version” of an object.
● One big, global, logical clock.
● resourceVersion is backed by etcd’s store revisions* – which provide a global ordering.
● Increases monotonically whenever any change to the state of the world happens.
● Gives you a global order of events that happen in the system.
● Most importantly - they enable optimistic concurrency control.
51. resourceVersion
https://sched.co/1R2m8
52. The Kubernetes Storage Layer - Past
53. The Kubernetes Storage Layer - Past
54. The Kubernetes Storage Layer - Past
55. The Kubernetes Storage Layer - Past
56. The Kubernetes Storage Layer - Past
57. The Kubernetes Storage Layer - Past
58. The Kubernetes Storage Layer - Past
59. The Kubernetes Storage Layer - Past
60. The Kubernetes Storage Layer - Past
61. The Kubernetes Storage Layer - Past
62. The Kubernetes Storage Layer - Past
If you had a controller, more the replicas, lesser the scalability
of etcd.
63. The Kubernetes Storage Layer - Present
64. The Kubernetes Storage Layer - Present
As with any problem in Computer Science, we solve this also with
a layer(s) of indirection.
65. The Kubernetes Storage Layer - Present
66. The Kubernetes Storage Layer - Present
Zooming in…
67. The Kubernetes Storage Layer - Present
68. The Kubernetes Storage Layer - Present
69. The Kubernetes Storage Layer - Present
70. The Kubernetes Storage Layer - Present
71. The Kubernetes Storage Layer - Present
72. The Kubernetes Storage Layer - Present
73. The Kubernetes Storage Layer - Present
74. The Kubernetes Storage Layer - Present
75. The Kubernetes Storage Layer - Present
76. The Kubernetes Storage Layer - Present
77. The Kubernetes Storage Layer - Present
78. The Kubernetes Storage Layer - Present
79. The Kubernetes Storage Layer - Present
80. The Kubernetes Storage Layer - Present
81. The Kubernetes Storage Layer - Present
● The store component is meant to reflect the state
of etcd.
82. The Kubernetes Storage Layer - Present
● The store component is meant to reflect the state
of etcd.
● Cacher per object type is created at API Server
start-up time.
83. The Kubernetes Storage Layer - Present
● The store component is meant to reflect the state
of etcd.
● Cacher per object type is created at API Server
start-up time.
● The caching layer can be disabled altogether
(--watch-cache=false).
84. The Kubernetes Storage Layer - Present
● The store component is meant to reflect the state
of etcd.
● Cacher per object type is created at API Server
start-up time.
● The caching layer can be disabled altogether
(--watch-cache=false).
● The caching layer can be disabled on a per object-
type (GroupResource) basis (--watch-cache-
sizes) by setting the size to 0, all non-zero values
are equivalent.
85. The Kubernetes Storage Layer - Present
How do different requests interact with our
present storage layer?
86. The Kubernetes Storage Layer - Present
Interlude – resourceVersion semantics
87. resourceVersion semantics
● In each type of CRUD request, you can pass a
resourceVersion parameter.
88. resourceVersion semantics
● In each type of CRUD request, you can pass a
resourceVersion parameter.
● The interpretation of this parameter translates into
data consistency guarantees.
89. resourceVersion semantics
● In each type of CRUD request, you can pass a
resourceVersion parameter.
● The interpretation of this parameter translates into
data consistency guarantees.
● Knowing
how
behaviour
changes
with
resourceVersion interpretation can be crucial to
scalability in some cases.
90. resourceVersion semantics
● In each type of CRUD request, you can pass a
resourceVersion parameter.
● The interpretation of this parameter translates into
data consistency guarantees.
● Knowing
how
behaviour
changes
with
resourceVersion interpretation can be crucial to
scalability in some cases.
For any GET request (Get(), GetList(), Watch())
91. resourceVersion semantics
● In each type of CRUD request, you can pass a
For any GET request (Get(), GetList(), Watch())
resourceVersion parameter.
● The interpretation of this parameter translates into
data consistency guarantees.
● Knowing
how
behaviour
changes
with
resourceVersion interpretation can be crucial to
scalability in some cases.
resourceVersion = “”
Most recent data
92. resourceVersion semantics
● In each type of CRUD request, you can pass a
For any GET request (Get(), GetList(), Watch())
resourceVersion parameter.
● The interpretation of this parameter translates into
data consistency guarantees.
● Knowing
how
behaviour
changes
with
resourceVersion interpretation can be crucial to
scalability in some cases.
resourceVersion = “” Most recent data
resourceVersion = “0” Any data (arbitrarily stale)
93. resourceVersion semantics
● In each type of CRUD request, you can pass a
For any GET request (Get(), GetList(), Watch())
resourceVersion parameter.
● The interpretation of this parameter translates into
data consistency guarantees.
● Knowing
how
behaviour
changes
with
resourceVersion interpretation can be crucial to
scalability in some cases.
resourceVersion = “” Most recent data
resourceVersion = “0” Any data (arbitrarily stale)
resourceVersion = “n” Data at n
94. resourceVersion semantics
● In each type of CRUD request, you can pass a
For any GET request (Get(), GetList(), Watch())
resourceVersion parameter.
● The interpretation of this parameter translates into
data consistency guarantees.
● Knowing
how
behaviour
changes
with
resourceVersion = “” Most recent data
resourceVersion = “0” Any data (arbitrarily stale)
resourceVersion = “n” Data at n
resourceVersion interpretation can be crucial to
scalability in some cases.
“Most recent data” is ensured by doing a quorum read in
etcd (a round of raft happens, and you get a linearizable
read).
95. resourceVersion semantics
● In each type of CRUD request, you can pass a
For any GET request (Get(), GetList(), Watch())
resourceVersion parameter.
● The interpretation of this parameter translates into
data consistency guarantees.
● Knowing
how
behaviour
changes
with
resourceVersion = “” Most recent data
resourceVersion = “0” Any data (arbitrarily stale)
resourceVersion = “n” Data at n
resourceVersion interpretation can be crucial to
scalability in some cases.
There is also resourceVersionMatch which compliments
resourceVersion in how they are interpreted. You always
need to provide this if you specify a resourceVersion in a
LIST request.
96. resourceVersion semantics
● In each type of CRUD request, you can pass a
For any GET request (Get(), GetList(), Watch())
resourceVersion parameter.
● The interpretation of this parameter translates into
data consistency guarantees.
● Knowing
how
behaviour
changes
with
resourceVersion = “” Most recent data
resourceVersion = “0” Any data (arbitrarily stale)
resourceVersion = “n” Data at n
resourceVersion interpretation can be crucial to
scalability in some cases.
There is also resourceVersionMatch which compliments
resourceVersion in how they are interpreted. You always
need to provide this if you specify a resourceVersion in a
LIST request.
● resourceVersionMatch=NotOlderThan
● resourceVersionMatch=Exact
97. resourceVersion semantics
● In each type of CRUD request, you can pass a
For any GET request (Get(), GetList(), Watch())
resourceVersion parameter.
● The interpretation of this parameter translates into
data consistency guarantees.
● Knowing
how
behaviour
changes
with
resourceVersion = “” Most recent data
resourceVersion = “0” Any data (arbitrarily stale)
resourceVersion = “n” Data at n
resourceVersion interpretation can be crucial to
scalability in some cases.
This still isn’t the full picture! Please see:
https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions
98. Request Behaviour
The best way to look at how the different layers of the Kubernetes Storage Layer come into
play and their scalability aspects, is to look at how different type of requests are served.
99. Request Behaviour
Create()
100. Request Behaviour
Create()
● A Create() request goes straight to etcd.
101. Request Behaviour
Create()
● A Create() request goes straight to etcd.
● The
created
object
gets
populated
in
the
watchCache async. because the Cacher also has a
WATCH open on etcd.
102. Request Behaviour
Delete()
103. Request Behaviour
Delete()
● A Delete() request tries to delete the version of
the object that exists in the watchCache (performs
a read op. (GetByKey) on the watchCache before
going to etcd.
104. Request Behaviour
Delete()
● A Delete() request tries to delete the version of
the object that exists in the watchCache (performs
a read op. (GetByKey) on the watchCache before
going to etcd.
● As usual, the changes are propagated back via the
WATCH on etcd.
105. Request Behaviour
GuaranteedUpdate()
106. Request Behaviour
GuaranteedUpdate()
● Similar to Delete(), we try and update the version
of the object that exists in the watchCache.
107. Request Behaviour
GuaranteedUpdate()
● Similar to Delete(), we try and update the version
of the object that exists in the watchCache.
● As usual, the changes are propagated back via the
WATCH on etcd.
108. Request Behaviour
Get()
109. Request Behaviour
Get()
If resourceVersion = “”
Request goes straight to etcd, served after a quorum
read (linearizable).
110. Request Behaviour
Get()
If resourceVersion = “0”
Request returns after performing a read on the
watchCache (which in turn queries the store), no
concern for freshness of data. Request doesn’t reach
etcd.
111. Request Behaviour
Get()
If resourceVersion = “n”; n != “0”
● We first wait for the cache to become as fresh as n.
○ Waiting has a timeout of ~3 seconds.
112. Request Behaviour
Get()
If resourceVersion = “n”; n != “0”
● We first wait for the cache to become as fresh as n.
○ Waiting has a timeout of ~3 seconds.
● Once that happens, the read happens on the
watchCache (which queries the underlying store)
to return the result.
113. Request Behaviour
GetList()
114. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
consistentReadFromStorage := resourceVersion == ""
hasContinuation := len(pred.Continue) > 0
hasLimit := pred.Limit > 0 && resourceVersion != "0"
unsupportedMatch := match != "" && match !=
metav1.ResourceVersionMatchNotOlderThan
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
115. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
consistentReadFromStorage := resourceVersion == ""
...
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
116. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
consistentReadFromStorage := resourceVersion == ""
...
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
Request goes straight to etcd and is served as a
linearizable read.
117. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
hasContinuation := len(pred.Continue) > 0
...
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
118. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
hasContinuation := len(pred.Continue) > 0
...
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
● If the LIST is a paginated one, no matter what
resourceVersion you give, the request is going
to be served from etcd.
● watchCache does not support pagination yet.
119. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
hasContinuation := len(pred.Continue) > 0
...
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
● If the LIST is a paginated one, no matter what
resourceVersion you give, the request is going
to be served from etcd.
● watchCache does not support pagination yet.
120. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
hasLimit := pred.Limit > 0 && resourceVersion != "0"
...
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
121. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
hasLimit := pred.Limit > 0 && resourceVersion != "0"
...
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
122. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
hasLimit := pred.Limit > 0 && resourceVersion != "0"
...
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
● If we have a limit set on our LIST with a non-zero
resourceVersion, we send it to etcd.
● Doesn’t matter if we have consistent data in the
cache or not, we cannot support a continue from
this limit later anyway.
123. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
hasLimit := pred.Limit > 0 && resourceVersion != "0"
...
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
● If no limit is set, we can serve the LIST from the
watchCache itself.
124. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
hasLimit := pred.Limit > 0 && resourceVersion != "0"
...
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
● But… if we set a limit and put resourceVersion
as 0, we essentially ignore the limit and list from the
cache anyway? Why?
125. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
hasLimit := pred.Limit > 0 && resourceVersion != "0"
...
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
● Well… resourceVersion=”0” is “Any data”
semantics, so cache makes sense.
126. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
hasLimit := pred.Limit > 0 && resourceVersion != "0"
...
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
Well…
resourceVersion=”0”
semantics, so cache makes sense
is
“Any
data”
127. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
hasLimit := pred.Limit > 0 && resourceVersion != "0"
...
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
More importantly, it allows us to support listing whose
responses we know have a good chance of being massive
thus reducing the load on etcd, i.e. initial lists.
128. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
hasLimit := pred.Limit > 0 && resourceVersion != "0"
...
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
Ex: - a ~large cluster can have O(1000) nodes, each node
having O(100) pods, so if a kubelet or a StatefulSet
controller were to perform a list on the pods…
129. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
hasLimit := pred.Limit > 0 && resourceVersion != "0"
...
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
Clients
that
(client-go
support
List/Watch
reflectors)
ensure
functionality
to
put
resourceVersion as 0 when performing the first list.
130. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
unsupportedMatch := match != "" && match !=
metav1.ResourceVersionMatchNotOlderThan
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
131. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
unsupportedMatch := match != "" && match !=
metav1.ResourceVersionMatchNotOlderThan
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
● The watchCache only supports NotOlderThan, so
if that is set, we serve the list from the
watchCache.
132. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
unsupportedMatch := match != "" && match !=
metav1.ResourceVersionMatchNotOlderThan
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
● If not, we serve the list from etcd, honouring exact
semantics.
133. Request Behaviour
GetList()
func shouldDelegateList(...) bool {
...
return consistentReadFromStorage || hasContinuation ||
hasLimit || unsupportedMatch
}
● The only time we serve a list from the watchCache
if we specify a non-empty resourceVersion
● AND it is not a paginated list (no limit or continue).
● AND we specify NotOlderThan semantics.
134. Request Behaviour
GetList()
There’s a few gotchas to keep in mind here!
135. Request Behaviour
GetList()
There’s a few gotchas to keep in mind here!
● When you need consistent LISTs, and the request
goes to etcd, the API Server can see spikes of
unbounded memory growth depending on response
sizes.
136. Request Behaviour
GetList()
There’s a few gotchas to keep in mind here!
● When you need consistent LISTs, and the request
goes to etcd, the API Server can see spikes of
unbounded memory growth depending on response
sizes.
● Data needs to be fetched from etcd, unmarshalled,
conversions take place, response is prepared.
137. Request Behaviour
GetList()
There’s a few gotchas to keep in mind here!
● When you need consistent LISTs, and the request
goes to etcd, the API Server can see spikes of
unbounded memory growth depending on response
sizes.
● Data needs to be fetched from etcd, unmarshalled,
conversions take place, response is prepared.
● Sometimes, paginating responses also will not help,
if each chunk itself is large.
138. Request Behaviour
GetList()
● KEP-3157 proposes, for informers, streaming data
from watchCache rather than paging in etcd.
https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/3157-watch-list
139. Request Behaviour
GetList()
● KEP-3157 proposes, for informers, streaming data
from watchCache rather than paging in etcd.
● Predictable memory footprint irrespective of LIST
response sizes and consistency requirements.
https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/3157-watch-list
140. Request Behaviour
GetList()
● KEP-3157 proposes, for informers, streaming data
from watchCache rather than paging in etcd.
● Predictable memory footprint irrespective of LIST
response sizes and consistency requirements.
● Handles the lack of pagination in watchCache.
https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/3157-watch-list
141. Request Behaviour
GetList()
● KEP-3157 proposes, for informers, streaming data
from watchCache rather than paging in etcd.
● Predictable memory footprint irrespective of LIST
response sizes and consistency requirements.
● Handles the lack of pagination in watchCache.
This is set to be in Alpha as of Kubernetes v1.28, please
try it out and provide feedback!
(Feature Gate: WatchList)
https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/3157-watch-list
142. Request Behaviour
GetList()
Another gotcha - time travelling, stale reads from
watchCache!
https://github.com/kubernetes/kubernetes/issues/59848
143. Request Behaviour
GetList()
Another gotcha - time travelling, stale reads from
watchCache!
● If you have an HA setup, with watchCache
enabled, one of them can be far behind the other.
144. Request Behaviour
GetList()
Another gotcha - time travelling, stale reads from
watchCache!
● If you have an HA setup, with watchCache
enabled, one of them can be far behind the other.
● Since informers/reflectors default to
resourceVersion=“0” for their first LIST due
scalability reasons, and these LISTs are served
from the watchCache, we can get “data from the
past”.
145. Request Behaviour
GetList()
Another gotcha - time travelling, stale reads from
watchCache!
Externally to Kubernetes - there are a few tools that
have come from collaboration between industry and
academia that can help automatically detect such issues
(and more) if your controllers are susceptible to them:
● sieve: https://github.com/sieve-project/sieve
● acto: https://github.com/xlab-uiuc/acto
146. Request Behaviour
GetList()
Another gotcha - time travelling, stale reads from
watchCache!
Within Kubernetes –
● There are a couple of KEPs that are attempting to
solve this in a scoped manner:
○ KEP-3157: Watch List
https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/3157-watch-list
147. Request Behaviour
GetList()
Another gotcha - time travelling, stale reads from
watchCache!
Within Kubernetes –
● There are a couple of KEPs that are attempting to
solve this in a scoped manner:
○ KEP-3157: Watch List
○ KEP-2340: Consistent Reads From Cache
https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/2340-Consistent-reads-from-cache
148. Request Behaviour
GetList()
Another gotcha - time travelling, stale reads from
watchCache!
Within Kubernetes –
● There are a couple of KEPs that are attempting to
solve this in a scoped manner:
○ KEP-3157: Watch List
○ KEP-2340: Consistent Reads From Cache
This is in Alpha since Kubernetes v1.28, please try it out and
provide feedback!
(Feature Gate: ConsistentListFromCache)
https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/2340-Consistent-reads-from-cache
149. Request Behaviour
GetList()
You get some nice performance benefits from both these For KEP-3157: Watch List
KEPs! (http://perf-dash.k8s.io)
150. Request Behaviour
GetList()
You get some nice performance benefits from both these For KEP-3157: Watch List
KEPs! (http://perf-dash.k8s.io)
151. Request Behaviour
GetList()
You get some nice performance benefits from both these For KEP-2340: Consistent Reads From Cache
KEPs! (https://github.com/kubernetes/test-infra/pull/30094)
152. Request Behaviour
Watch()
153. Request Behaviour
Watch()
If resourceVersion = “”, we delegate the request to
etcd as always.
154. Request Behaviour
Watch()
Otherwise, we serve it from the watchCache.
155. Request Behaviour
Watch()
Otherwise, we serve it from the watchCache.
● To do so - we first setup a cacheWatcher which is
responsible for service a Watch request.
156. Request Behaviour
Watch()
Otherwise, we serve it from the watchCache.
● To do so - we first setup a cacheWatcher which is
responsible for service a Watch request.
● Each cacheWatcher allocates an input buffer
statically, size of which is determined by some
heuristics we’ve seen in our scale testing.
157. Request Behaviour
Watch()
Otherwise, we serve it from the watchCache.
● To do so - we first setup a cacheWatcher which is
responsible for service a Watch request.
● Each cacheWatcher allocates an input buffer
statically, size of which is determined by some
heuristics we’ve seen in our scale testing.
● As soon as buffer becomes full, we terminate the
Watch and clients re-establish one again against the
last observed resourceVersion.
158. Request Behaviour
Watch()
Otherwise, we serve it from the watchCache.
● Essentially, the cost of keeping-up with Watch
events, is establishing a Watch connection.
159. Request Behaviour
Watch()
Otherwise, we serve it from the watchCache.
● Essentially, the cost of keeping-up with Watch
events, is establishing a Watch connection.
● However, a slow client, slow server, or just a storm
of rapid updates can cause the buffer to become
full, and necessitating a new connection.
160. Request Behaviour
Watch()
Otherwise, we serve it from the watchCache.
● Essentially, the cost of keeping-up with Watch
events, is establishing a Watch connection.
● However, a slow client, slow server, or just a storm
of rapid updates can cause the buffer to become
full, and necessitating a new connection.
https://github.com/kubernetes/kubernetes/issues/121438
161. Conclusion
•
The List + Watch pattern is a central theme to how the Kubernetes machine works, and helps enable the controller
pattern.
•
Different requests interact differently with each of the layers depending on the type of request and the value of the
resourceVersion (and resourceVersionMatch) specified.
•
Specification of resourceVersion and resourceVersionMatch can help you make the tradeoff between data
consistency and latency, majorly impacting the scalability of your cluster.
•
Unless you have strict consistency requitements, trust the watchCache, but beware of time travel queries!
162. References
• [Design Proposal] New storage layer design
• Cacher Source Code
• etcd3 storage layer source code
• shouldDelegateList
• [Kubernetes Enhancement Proposal] Consistent Reads From Cache
• [Kubernetes Enhancement Proposal] Watch List
• Sieve: Automatic Reliability Testing for Kubernetes Controllers and Operators
• Acto: Push-Button End-to-End Testing of Kubernetes Operators/Controllers
163. Thank you!
Twitter (X?): @MadhavJivrajani
Kubernetes/CNCF Slack: @madhav
164. Please scan the QR Code above
to leave feedback on this session