Public Review for Probabilistic Lossy Counting: An efficient algorithm for finding heavy hitters Xenofontas Dimitropoulos, Paul Hurley, Andreas Kind Imagine that you see a large number of individual transactions (such as Amazon book sales), and you want to calculate what are the top sellers today. Or imagine that you are monitoring network traffic and you want to know which hosts/subnets are responsible for most of the traffic. This is a problem of finding heavy hitters given a stream of elements. A straight forward method to tackle this problem is to store each element identifier with a corresponding counter monitoring the number of occurrences of that element. Then, you sort the elements accordingly to their counters and you can easily get the most frequent elements. However, in many real scenarios, this simple solution is not efficient of computationally feasible. For instance, consider the case of tracking the pairs of IP address that generate the most traffic over some time period. You need 16,384 PBytes of memory and a lot of time to sort and scan that memory array, which often makes the problem non-computationally feasible. For these reasons, during recent years, techniques to computing heavy hitters using limited memory
/lp/association-for-computing-machinery/probabilistic-lossy-counting-an-efficient-algorithm-for-finding-heavy-MgutxM5qpA