水库抽样

Reservoir sampling is a technique for selecting a fair random sample when you don't know the size of the set you're sampling from. By the end of this essay you will know:

水库抽样是一种在不知道你所抽样集合大小时选择公平随机样本的技术。在这篇文章结束时,你将知道:

  • When you would need reservoir sampling.
  • 当你需要水库抽样时。
  • The mathematics behind how it works, using only basic operations: subtraction, multiplication, and division. No math notation, I promise.
  • 它是如何工作的数学原理,仅使用基本运算:减法、乘法和除法。没有数学符号,我保证。
  • A simple way to implement reservoir sampling if you want to use it.
  • 如果你想使用水库抽样,这是一种简单的实现方法。

A picture of a cartoon husky called "Doe"

Before you scroll! This post has been sponsored by the wonderful folks at ittybit, and their API for working with videos, images, and audio. If you need to store, encode, or get intelligence from the media files in your app, check them out!

在你滚动之前!这篇文章得到了来自 ittybit 的精彩赞助,他们提供用于处理视频、图像和音频的 API。如果你需要存储、编码或从应用中的媒体文件获取智能,快去看看他们吧!

# Sampling when you know the size

# 当你知道大小时的抽样

In front of you are 10 playing cards and I ask you to pick 3 at random. How do you do it?

在你面前有10张扑克牌,我请你随机选择3张。你怎么做呢?

The first technique that might come to mind from your childhood is to mix them all up in the middle. Then you can straighten them out and pick the first 3. You can see this happen below by clicking "Shuffle."

你可能会想到的第一个技巧是把它们全部混在一起。然后你可以把它们理顺并选择前3张。你可以通过点击“洗牌”看到这个过程。

Every time you click "Shuffle," the chart below tracks what the first 3 cards were.

每次你点击“洗牌”,下面的图表会记录前3张牌是什么。

At first you'll notice some cards are selected more than others, but if you keep going it will even out. All cards have an equal chance of being selected. This makes it "fair."

起初你会注意到某些牌被选择的次数比其他牌多,但如果你继续下去,它们会趋于平衡。所有牌被选择的机会是相等的。这使得它是“公平的”。

Click "Shuffle 100 times" until the chart evens out. You can reset the chart if you'd like to start over.

点击“洗牌100次”,直到图表趋于平衡。如果你想重新开始,可以重置图表。

This method works fine with 10 cards, but what if you had 1 million cards? Mixing those up won't be easy. Instead, we could use a random number generator to pick 3 indices. These would be our 3 chosen cards.

这种方法在10张牌上效果很好,但如果你有100万张牌呢?混合这些牌可不容易。相反,我们可以使...

开通本站会员,查看完整译文。

inicio - Wiki
Copyright © 2011-2025 iteam. Current version is 2.143.0. UTC+08:00, 2025-05-19 07:25
浙ICP备14020137号-1 $mapa de visitantes$