Streaming Algorithms Reference

This section contains reference of the data sketches.

Symbols in the package CL-DATA-STRUCTURES.STREAMING-ALGORITHMS:
BLOOM-FILTER
Lambda List: (RANGE &KEY HASH-FN WIDTH DEPTH KEY HASHES DATA-SKETCH)
Arguments:
  • RANGE, Input for the creation of the bloom filter.
  • WIDTH, What is the inner size?
  • HEIGHT, What is the inner size?
  • HASH-FN, Hashing function. SXHASH will do for strings.
  • COUNT, How many bits are used for each item?
  • KEY, Function used to extract value for hashing.
  • HASHES, Optional hashes vector. Needs to be supplied in order to ensure that the same hash values are generated between different filters.
  • DATA-SKETCH, Instead of the bits and the hash-fn, the user can pass a data-sketch argument.

Description: Creates bloom filter out of elements in the range. Bloom filter is a memory efficient data structure allowing to check if an item is absent from the range. If AT returns nil, the item is certainly absent. If AT returns T item either present or not.
Returns: Instance of the fundamental-data-sketch class. Use cl-ds:at to check if element is present. False positives are possible, false negatives are not possible.
Exceptional situations:
  • Will signal a TYPE-ERROR if either SPACE or COUNT is not integer.
  • Will signal a CL-DS:ARGUMENT-VALUE-OUT-OF-BOUNDS if either width, heigh, or COUNT is not above 0.
  • Will signal a TYPE-ERROR if HASH-FN is not funcallable.
  • Will signal a TYPE-ERROR if HASHES is not (OR NULL (SIMPLE-ARRAY FIXNUM (* 2)).

APPROXIMATED-COUNTS
Lambda List: (RANGE &KEY HASH-FN DEPTH WIDTH KEY HASHES DATA-SKETCH)
Arguments:
  • RANGE, Object to aggregate.
  • HASH-FN, Hashing function. SXHASH will do for strings.
  • SPACE, Positive integer. Size of the counters array
  • COUNT, Number of hashing functions used.
  • DATA-SKETCH, Instead of the bits and the hash-fn, the user can pass a data-sketch argument.

Description: Calculates estimated counts using Min-Count sketch algorithm. This requires only a constant amount of memory.
Returns: Instance of the fundamental-data-sketch class. Use CL-DS:AT to extract count estimate for element from it.
Exceptional situations:
  • Will signal a TYPE-ERROR when either COUNT or SPACE is not integer.
  • Will signal a TYPE-ERROR when HASH-FN is not funcallable.
  • Will signal a TYPE-ERROR when HASHES is not either NIL or (SIMPLE-ARRAY FIXNUM (*)).
  • Will signal a CL-DS:ARGUMENT-VALUE-OUT-OF-BOUNDS if either COUNT or SPACE is not above zero.

Notes:
  • Quality of the estimate directly depends on DEPTH and WIDTH.
  • Sensitive to a hash function. Large avalanche factor is very helpful.

APPROXIMATED-TOP-K
Lambda List: (RANGE K &KEY HASH-FN WIDTH DEPTH KEY HASHES TEST DATA-SKETCH)
Arguments:
  • RANGE, Input for scanning.
  • K, Determines the maximum size of the result.
  • HASHES, Optional hashes vector. Needs to be supplied in order to ensure that the same hash values are generated between different filters.
  • TEST, Test to check elements equality.
  • HASH-FN, Hashing function, defaults to sxhash.
  • DEPTH, Depth for count min sketch (APPROXIMATED-COUNTS). Higher value increases confidence.
  • WIDTH, Width for count min sketch (APPROXIMATED-COUNTS). Higher value decreases error.

Description: Attempts to find the top most common elements in a stream. Uses count-min sketch to accomplish that, so it will need only constant memory.
Returns: Range of pairs. CAR of pair is the object, CDR of pair is estimated count. Range is ordered, with the most frequent element at zero and at most it has K values.
APPROXIMATED-SET-CARDINALITY
Lambda List: (RANGE &KEY HASH-FN KEY DATA-SKETCH)
Arguments:
  • RANGE, Object to aggregate.
  • HASH-FN, Hashing function. SXHASH will do for strings.
  • DATA-SKETCH, Instead of the bits and the hash-fn, the user can pass a data-sketch argument.
  • KEY, A function used to extract value from each element.

Examples:
(LET ((DATA
       (CL-DATA-STRUCTURES:XPR (:I 0)
         (WHEN (< I 500000)
           (CL-DATA-STRUCTURES:SEND-RECUR (RANDOM 99999999999) :I (1+ I))))))
  (PROVE.TEST:OK
   (< 490000
      (CL-DATA-STRUCTURES:VALUE
       (CL-DATA-STRUCTURES.STREAMING-ALGORITHMS:APPROXIMATED-SET-CARDINALITY
        DATA :HASH-FN #'CL-DATA-STRUCTURES.UTILS:HASH-INTEGER))
      510000)))

Description: Calculates the estimated set cardinality using the HyperLogLog algorithm. This requires only a constant (and modest) amount of memory.
Returns: Instance of the fundamental-data-sketch class. Use CL-DS:VALUE to extract estimate from it.
Exceptional situations:
  • Will signal a TYPE-ERROR if HASH-FN is not funcallable.

Notes:
  • This algorithm gives a solid estimate for large sets, not so good for small sets.
  • Sensitive to a hash function. Large avalanche factor is very helpful. Needs all 64 bits so sxhash won't be fine.
  • Can be used to (for instance) estimate number of keys before creating a hash table. A good estimate of size minimizes rehashing and therefore reduces both memory allocation and time required to fill the hash table.

APPROXIMATED-HISTOGRAM
Lambda List: (RANGE &KEY KEY MAXIMAL-BINS-COUNT DATA-SKETCH)
Arguments:
  • RANGE, Input for the creation of the histogram.
  • MAXIMAL-BINS-COUNT, The maximal number of bins allowed for the histogram to create.
  • KEY, Function used to extract value for the histogram.
  • DATA-SKETCH, Instead of maximal-bins-count, the user can pass a data-sketch argument.

Description: Creates approximated histogram, as described in A Streaming Parallel Decision Tree Algorithm article.
Returns: Instance of APPROXIMATED-HISTOGRAM. Use cl-ds:at to check for quantile value and obtain approximation of statistical summaries using dedicated functions with the APPROXIMATED-HISTOGRAM prefix.
APPROXIMATED-HISTOGRAM-QUANTILE
Lambda List: (HISTOGRAM QUANTILE &REST OTHER-QUANTILES &AUX (ALL-QUANTILES (CONS QUANTILE OTHER-QUANTILES)))
Examples:
(LET ((Q
       (APPROXIMATED-HISTOGRAM-QUANTILE
        (APPROXIMATED-HISTOGRAM
         (CL-DATA-STRUCTURES.ALGORITHMS:ON-EACH
          (CL-DATA-STRUCTURES:IOTA-RANGE :FROM 1 :TO 100)
          (ALEXANDRIA:RCURRY #'COERCE 'DOUBLE-FLOAT)))
        0.5 0.95)))
  (PROVE.TEST:OK (< 49 (FIRST Q) 51))
  (PROVE.TEST:OK (< 94 (SECOND Q) 96)))

APPROXIMATED-HISTOGRAM-MEAN
Lambda List: (HISTOGRAM)
Description: Estimates mean of the distribution.
APPROXIMATED-HISTOGRAM-MEDIAN
Lambda List: (HISTOGRAM)
Description: Estimate median of the approximated distribution.
APPROXIMATED-HISTOGRAM-VARIANCE
Lambda List: (HISTOGRAM)
Description: Estimates variance of the distribution.
APPROXIMATED-HISTOGRAM-SUM
Lambda List: (HISTOGRAM)
Description: Estimates total sum of all elements passed to the histogram.
APPROXIMATED-HISTOGRAM-ADD
Lambda List: (HISTOGRAM VALUE)
Description: Adds a single element to the distribution.
APPROXIMATED-HISTOGRAM-COUNT-LOWER
Lambda List: (HISTOGRAM VALUE &REST OTHER-VALUES &AUX (ALL-VALUES (CONS VALUE OTHER-VALUES)))
Description: How many elements with value lower then the specified there is?
APPROXIMATED-HISTOGRAM-RANK-ORDER
Lambda List: (HISTOGRAM VALUE &REST OTHER-VALUES)
Description: What is the percantage of element in the histogram with value lower then the specified there is?
APPROXIMATED-HISTOGRAM-TRUNCATED-MEAN
Lambda List: (HISTOGRAM FRACTION)
Description: Estimate mean of the approximated distribution. Truncates outliers by the fraction.
UNION
Lambda List: (FIRST-SKETCH &REST MORE-SKETCHES)
Description: Creates new data-sketch from the provided. Can be used to join sketches built on different data chunks.
CLEAN-SKETCH
Lambda List: (FUNCTION &REST ARGUMENTS &KEY HASHES HASH-FN DEPTH WIDTH MAXIMAL-BINS-COUNT &ALLOW-OTHER-KEYS)
Description: Creates a new, empty data-sketch that would be produced by the function. New data-sketch can be cloned and passed as :data-sketch. This allows to keep compatibility between results of call to the streaming function.
FUNDAMENTAL-DATA-SKETCH
Inheritance
Description: The base class of all data sketches. Instances of this class can be passed to streaming algorihms as initial states, cloned and combined into unions.