Reservoir Based Sampling
I encountered a programming problem where a very large stream of data was coming through and I needed to get a decent sample of random values from the stream. I didn’t want to load it all in to memory so I opted for Reservoir based sampling. “Reservoir Sampling is an algorithm for sampling elements from a stream of data.” gregable.com. Using reservoir based sampling, I was able to efficiently return a set of random values pretty easily. Implemented with Java 8 Streams I was able to create a reusable generic sampler.
The results are below.
References
- Algorithms Every Data Scientist Should Know: Reservoir Sampling
- Reservoir Sampling
- Reservoir Sampling
Comments