Streaming Algorithms Reference
This section contains reference of the data sketches.
Symbols in the package CL-DATA-STRUCTURES.STREAMING-ALGORITHMS:
BLOOM-FILTER
Lambda List:
(RANGE &KEY HASH-FN WIDTH DEPTH KEY HASHES DATA-SKETCH)
Arguments:
- RANGE, Input for the creation of the bloom filter.
- WIDTH, What is the inner size?
- HEIGHT, What is the inner size?
- HASH-FN, Hashing function. SXHASH will do for strings.
- COUNT, How many bits are used for each item?
- KEY, Function used to extract value for hashing.
- HASHES, Optional hashes vector. Needs to be supplied in order to ensure that the same hash values are generated between different filters.
- DATA-SKETCH, Instead of the bits and the hash-fn, the user can pass a data-sketch argument.
Description: Creates bloom filter out of elements in the range. Bloom filter is a memory efficient data structure allowing to check if an item is absent from the range. If AT returns nil, the item is certainly absent. If AT returns T item either present or not.
Returns: Instance of the fundamental-data-sketch class. Use cl-ds:at to check if element is present. False positives are possible, false negatives are not possible.
Exceptional situations:
- Will signal a TYPE-ERROR if either SPACE or COUNT is not integer.
- Will signal a CL-DS:ARGUMENT-VALUE-OUT-OF-BOUNDS if either width, heigh, or COUNT is not above 0.
- Will signal a TYPE-ERROR if HASH-FN is not funcallable.
- Will signal a TYPE-ERROR if HASHES is not (OR NULL (SIMPLE-ARRAY FIXNUM (* 2)).
APPROXIMATED-COUNTS
Lambda List:
(RANGE &KEY HASH-FN DEPTH WIDTH KEY HASHES DATA-SKETCH)
Arguments:
- RANGE, Object to aggregate.
- HASH-FN, Hashing function. SXHASH will do for strings.
- SPACE, Positive integer. Size of the counters array
- COUNT, Number of hashing functions used.
- DATA-SKETCH, Instead of the bits and the hash-fn, the user can pass a data-sketch argument.
Description: Calculates estimated counts using Min-Count sketch algorithm. This requires only a constant amount of memory.
Returns: Instance of the fundamental-data-sketch class. Use CL-DS:AT to extract count estimate for element from it.
Exceptional situations:
- Will signal a TYPE-ERROR when either COUNT or SPACE is not integer.
- Will signal a TYPE-ERROR when HASH-FN is not funcallable.
- Will signal a TYPE-ERROR when HASHES is not either NIL or (SIMPLE-ARRAY FIXNUM (*)).
- Will signal a CL-DS:ARGUMENT-VALUE-OUT-OF-BOUNDS if either COUNT or SPACE is not above zero.
Notes:
- Quality of the estimate directly depends on DEPTH and WIDTH.
- Sensitive to a hash function. Large avalanche factor is very helpful.
APPROXIMATED-TOP-K
Lambda List:
(RANGE K &KEY HASH-FN WIDTH DEPTH KEY HASHES TEST DATA-SKETCH)
Arguments:
- RANGE, Input for scanning.
- K, Determines the maximum size of the result.
- HASHES, Optional hashes vector. Needs to be supplied in order to ensure that the same hash values are generated between different filters.
- TEST, Test to check elements equality.
- HASH-FN, Hashing function, defaults to sxhash.
- DEPTH, Depth for count min sketch (APPROXIMATED-COUNTS). Higher value increases confidence.
- WIDTH, Width for count min sketch (APPROXIMATED-COUNTS). Higher value decreases error.
Description: Attempts to find the top most common elements in a stream. Uses count-min sketch to accomplish that, so it will need only constant memory.
Returns: Range of pairs. CAR of pair is the object, CDR of pair is estimated count. Range is ordered, with the most frequent element at zero and at most it has K values.
APPROXIMATED-SET-CARDINALITY
Lambda List:
(RANGE &KEY HASH-FN KEY DATA-SKETCH)
Arguments:
- RANGE, Object to aggregate.
- HASH-FN, Hashing function. SXHASH will do for strings.
- DATA-SKETCH, Instead of the bits and the hash-fn, the user can pass a data-sketch argument.
- KEY, A function used to extract value from each element.
Examples:
(LET ((DATA
(CL-DATA-STRUCTURES:XPR (:I 0)
(WHEN (< I 500000)
(CL-DATA-STRUCTURES:SEND-RECUR (RANDOM 99999999999) :I (1+ I))))))
(PROVE.TEST:OK
(< 490000
(CL-DATA-STRUCTURES:VALUE
(CL-DATA-STRUCTURES.STREAMING-ALGORITHMS:APPROXIMATED-SET-CARDINALITY
DATA :HASH-FN #'CL-DATA-STRUCTURES.UTILS:HASH-INTEGER))
510000)))
Description: Calculates the estimated set cardinality using the HyperLogLog algorithm. This requires only a constant (and modest) amount of memory.
Returns: Instance of the fundamental-data-sketch class. Use CL-DS:VALUE to extract estimate from it.
Exceptional situations:
- Will signal a TYPE-ERROR if HASH-FN is not funcallable.
Notes:
- This algorithm gives a solid estimate for large sets, not so good for small sets.
- Sensitive to a hash function. Large avalanche factor is very helpful. Needs all 64 bits so sxhash won't be fine.
- Can be used to (for instance) estimate number of keys before creating a hash table. A good estimate of size minimizes rehashing and therefore reduces both memory allocation and time required to fill the hash table.
APPROXIMATED-HISTOGRAM
Lambda List:
(RANGE &KEY KEY MAXIMAL-BINS-COUNT DATA-SKETCH)
Arguments:
- RANGE, Input for the creation of the histogram.
- MAXIMAL-BINS-COUNT, The maximal number of bins allowed for the histogram to create.
- KEY, Function used to extract value for the histogram.
- DATA-SKETCH, Instead of maximal-bins-count, the user can pass a data-sketch argument.
Description: Creates approximated histogram, as described in A Streaming Parallel Decision Tree Algorithm article.
Returns: Instance of APPROXIMATED-HISTOGRAM. Use cl-ds:at to check for quantile value and obtain approximation of statistical summaries using dedicated functions with the APPROXIMATED-HISTOGRAM prefix.
APPROXIMATED-HISTOGRAM-QUANTILE
Lambda List:
(HISTOGRAM QUANTILE &REST OTHER-QUANTILES &AUX
(ALL-QUANTILES (CONS QUANTILE OTHER-QUANTILES)))
Examples:
(LET ((Q
(APPROXIMATED-HISTOGRAM-QUANTILE
(APPROXIMATED-HISTOGRAM
(CL-DATA-STRUCTURES.ALGORITHMS:ON-EACH
(CL-DATA-STRUCTURES:IOTA-RANGE :FROM 1 :TO 100)
(ALEXANDRIA:RCURRY #'COERCE 'DOUBLE-FLOAT)))
0.5 0.95)))
(PROVE.TEST:OK (< 49 (FIRST Q) 51))
(PROVE.TEST:OK (< 94 (SECOND Q) 96)))
APPROXIMATED-HISTOGRAM-MEAN
Lambda List:
(HISTOGRAM)
Description:
Estimates mean of the distribution.
APPROXIMATED-HISTOGRAM-MEDIAN
Lambda List:
(HISTOGRAM)
Description:
Estimate median of the approximated distribution.
APPROXIMATED-HISTOGRAM-VARIANCE
Lambda List:
(HISTOGRAM)
Description:
Estimates variance of the distribution.
APPROXIMATED-HISTOGRAM-SUM
Lambda List:
(HISTOGRAM)
Description:
Estimates total sum of all elements passed to the histogram.
APPROXIMATED-HISTOGRAM-ADD
Lambda List:
(HISTOGRAM VALUE)
Description:
Adds a single element to the distribution.
APPROXIMATED-HISTOGRAM-COUNT-LOWER
Lambda List:
(HISTOGRAM VALUE &REST OTHER-VALUES &AUX (ALL-VALUES (CONS VALUE OTHER-VALUES)))
Description:
How many elements with value lower then the specified there is?
APPROXIMATED-HISTOGRAM-RANK-ORDER
Lambda List:
(HISTOGRAM VALUE &REST OTHER-VALUES)
Description:
What is the percantage of element in the histogram with value lower then the specified there is?
APPROXIMATED-HISTOGRAM-TRUNCATED-MEAN
Lambda List:
(HISTOGRAM FRACTION)
Description:
Estimate mean of the approximated distribution. Truncates outliers by the fraction.
UNION
Lambda List:
(FIRST-SKETCH &REST MORE-SKETCHES)
Description:
Creates new data-sketch from the provided. Can be used to join sketches built on different data chunks.
CLEAN-SKETCH
Lambda List:
(FUNCTION &REST ARGUMENTS &KEY HASHES HASH-FN DEPTH WIDTH MAXIMAL-BINS-COUNT
&ALLOW-OTHER-KEYS)
Description:
Creates a new, empty data-sketch that would be produced by the function. New data-sketch can be cloned and passed as :data-sketch. This allows to keep compatibility between results of call to the streaming function.
FUNDAMENTAL-DATA-SKETCH
Description: The base class of all data sketches. Instances of this class can be passed to streaming algorihms as initial states, cloned and combined into unions.