Streaming Algorithms Reference

This section contains reference of the data sketches.

Symbols in the package CL-DATA-STRUCTURES.STREAMING-ALGORITHMS:
BLOOM-FILTER
Lambda List: (RANGE &KEY HASH-FN WIDTH DEPTH KEY HASHES DATA-SKETCH)
Arguments:
  • RANGE, Input for the creation of the bloom filter.
  • SPACE, What is the bloom vector size?
  • HASH-FN, Hashing function. SXHASH will do for strings.
  • COUNT, How many bits are used for each item?
  • KEY, Function used to extract value for hashing.
  • HASHES, Optional hashes vector. Needs to be supplied in order to ensure that the same hash values are generated between different filters.
  • DATA-SKETCH, Instead of the bits and the hash-fn, the user can pass a data-sketch argument.

Description: Creates bloom filter out of elements in the range. Bloom filter is a memory efficient data structure allowing to check if an item is absent from the range. If AT returns nil, the item is certainly absent. If AT returns T item either present or not.
Returns: Instance of the fundamental-data-sketch class. Use cl-ds:at to check if element is present. False positives are possible, false negatives are not possible.
Exceptional situations:
  • Will signal a TYPE-ERROR if either SPACE or COUNT is not integer.
  • Will signal a CL-DS:ARGUMENT-VALUE-OUT-OF-BOUNDS if either SPACE or COUNT is not above 0.
  • Will signal a TYPE-ERROR if HASH-FN is not funcallable.
  • Will signal a TYPE-ERROR if HASHES is not (OR NULL (SIMPLE-ARRAY FIXNUM (* 2)).

APPROXIMATED-COUNTS
Lambda List: (RANGE &KEY HASH-FN DEPTH WIDTH KEY HASHES DATA-SKETCH)
Arguments:
  • RANGE, Object to aggregate.
  • HASH-FN, Hashing function. SXHASH will do for strings.
  • SPACE, Positive integer. Size of the counters array
  • COUNT, Number of hashing functions used.
  • DATA-SKETCH, Instead of the bits and the hash-fn, the user can pass a data-sketch argument.

Description: Calculates estimated counts using Min-Count sketch algorithm. This requires only a constant amount of memory.
Returns: Instance of the fundamental-data-sketch class. Use CL-DS:AT to extract count estimate for element from it.
Exceptional situations:
  • Will signal a TYPE-ERROR when either COUNT or SPACE is not integer.
  • Will signal a TYPE-ERROR when HASH-FN is not funcallable.
  • Will signal a TYPE-ERROR when HASHES is not either NIL or (SIMPLE-ARRAY FIXNUM (*)).
  • Will signal a CL-DS:ARGUMENT-VALUE-OUT-OF-BOUNDS if either COUNT or SPACE is not above zero.

Notes:
  • Quality of the estimate directly depends on DEPTH and WIDTH.
  • Sensitive to a hash function. Large avalanche factor is very helpful.

APPROXIMATED-TOP-K
Lambda List: (RANGE K &KEY HASH-FN WIDTH DEPTH KEY HASHES TEST DATA-SKETCH)
Arguments:
  • RANGE, Input for scanning.
  • K, Determines the maximum size of the result.
  • HASHES, Optional hashes vector. Needs to be supplied in order to ensure that the same hash values are generated between different filters.
  • TEST, Test to check elements equality.
  • HASH-FN, Hashing function, defaults to sxhash.
  • DEPTH, Depth for count min sketch (APPROXIMATED-COUNTS). Higher value increases confidence.
  • WIDTH, Width for count min sketch (APPROXIMATED-COUNTS). Higher value decreases error.

Description: Attempts to find the top most common elements in a stream. Uses count-min sketch to accomplish that, so it will need only constant memory.
Returns: Range of pairs. CAR of pair is the object, CDR of pair is estimated count. Range is ordered, with the most frequent element at zero and at most it has K values.
APPROXIMATED-SET-CARDINALITY
Lambda List: (RANGE &KEY HASH-FN KEY DATA-SKETCH)
Arguments:
  • RANGE, Object to aggregate.
  • HASH-FN, Hashing function. SXHASH will do for strings.
  • DATA-SKETCH, Instead of the bits and the hash-fn, the user can pass a data-sketch argument.
  • KEY, A function used to extract value from each element.

Examples:
(LET ((DATA
       (CL-DATA-STRUCTURES:XPR (:I 0)
         (WHEN (< I 500000)
           (CL-DATA-STRUCTURES:SEND-RECUR (RANDOM 99999999999) :I (1+ I))))))
  (PROVE.TEST:OK
   (< 490000
      (CL-DATA-STRUCTURES:VALUE
       (CL-DATA-STRUCTURES.STREAMING-ALGORITHMS:APPROXIMATED-SET-CARDINALITY
        DATA :HASH-FN #'CL-DATA-STRUCTURES.UTILS:HASH-INTEGER))
      510000)))

Description: Calculates the estimated set cardinality using the HyperLogLog algorithm. This requires only a constant (and modest) amount of memory.
Returns: Instance of the fundamental-data-sketch class. Use CL-DS:VALUE to extract estimate from it.
Exceptional situations:
  • Will signal a TYPE-ERROR if HASH-FN is not funcallable.

Notes:
  • This algorithm gives a solid estimate for large sets, not so good for small sets.
  • Sensitive to a hash function. Large avalanche factor is very helpful. Needs all 64 bits so sxhash won't be fine.
  • Can be used to (for instance) estimate number of keys before creating a hash table. A good estimate of size minimizes rehashing and therefore reduces both memory allocation and time required to fill the hash table.

UNION
Lambda List: (FIRST-SKETCH &REST MORE-SKETCHES)
Description: Creates new data-sketch from the provided. Can be used to join sketches built on different data chunks.
CLEAN-SKETCH
Lambda List: (FUNCTION &REST ARGUMENTS &KEY HASHES HASH-FN DEPTH WIDTH &ALLOW-OTHER-KEYS)
Description: Creates a new, empty data-sketch that would be produced by the function. New data-sketch can be cloned and passed as :data-sketch. This allows to keep compatibility between results of call to the streaming function.
FUNDAMENTAL-DATA-SKETCH
Inheritance
Description: The base class of all data sketches. Instances of this class can be passed to streaming algorihms as initial states, cloned and combined into unions.