Design of cl-data-structures aggregation protocol
Table of Contents
Background
The very first post I have published on this blog was describing what are cl-data-structures ranges. In this post, I want to show how the introduction of ranges allowed to build universal aggregation functions on top of them. Hopefully, this short article can be beneficial as a demonstration of my approach to design and particular choices in the cl-data-structures.
What is aggregation function?
It is not so different from the SQL, really. Pretty much everything that is supposed to transform the whole range of the values into a single value can be considered to be an aggregation function. Yup, that's right, here comes your MAX, MIN, AVERAGE, and so other functions. The usefulness of such functions shouldn't need explanation, but some of the design considerations are less obvious.
First of it is important to note that aggregation functions don't have to work in the very same way on every data. In fact, the major fraction of usefulness of such functions in SQL stems from the fact that aggregation can be controlled finely. SQL statement "select baz, MAX(foo) from bar GROUP BY baz" is a classic example demonstrating this angle, but it is worth noting that idea can go further than that.
Let's consider for instance calculating Hodges-Lehmann estimator. To obtain the value of this estimator we will have to:
- Obtain a series containing every possible pair of observations in the data set.
- Calculate average for each of the pairs.
- Obtain median out of average.
Hodges-Lehmann estimator is simple way to calculate the expected value of the parameter in the sample, even in the presence of outliers that would skew ordinary mean (mean for instance should not be used to describe expected salary in the country, but it is here and at this point I start to think that this malicious and intentional act). But what is the algorithmic complexity of the algorithm? Well, the first step clearly needs both quadratic time and space. Next one is linear in time. Next, we need to sort all values and find median: this would be log-linear by using optimal general purpose sorting algorithm or perhaps linear if we are feeling fancy enough for the bucket sort. If we want to just stick to the CL:SORT function we are ending up with worse then quadratic time. That is truly brutal for the large datasets.
But there is a solution: just run all your calculations on a manageable sample. Obtained value can obviously diverge from the value calculated on a full sample. To counteract this, we can repeat the procedure multiple times and select (for instance) median of all sample values. This procedure is called bootstrapping, and it is crucial when trying to calculate statistical parameters by employing algorithms like the one before. Yet, we can't write code like "SELECT Hodges-Lehmann(foo) from bar BOOTSTRAP SAMPLES 1000 SIZE 10000" in the SQL. But wouldn't it be a neat feature?
Also, let's not forget that some aggregation functions require more than just one step. Examples include everything that requires sorting, which naturally includes the MEDIAN and everything that includes the median as an intermediate step.
Mindset: Solve et coagula
Fundamentally any software architecture is just an exercise of skillfully dividing and combining parts of the problem in order to realize various goals. Therefore since we have at least a vague understanding of goals and the problem itself, let's consider how to divide the problem in order to achieve the desired result.
- We need a representation of the values that are about to be aggregated. This clearly should be just a range as it fits the description perfectly.
- We need the ability to invoke aggregation. A usual way to do so is to simply call a function with the proper argument. We will stick to it because I think that this is a type of the interface programmer would expect and want to use.
- W need means to control aggregation as described in the first paragraph. This is the least obvious part, and therefore it probably should be divided into more subproblems. Let's focus on this specific part for the moment.
Control flow
As Alan Kay would like to point out, the critical part of the object-oriented programming is communication, and since (as already pointed out) essence of all software architectures is combining and dividing it stands to reason that one should think how to control the aggregation by the means of the messages.
So how to construct communication protocol that can somehow express the aggregation process? At the very top of the abstraction stack, we have the function called by the user. This function should just act as an entry point for the internal machinery and therefore it makes sense to delegate everything from this function call into a new generic function: apply-aggregation-function. This function will accept the original range passed to the aggregation function as well as the object representing the function.
This perhaps requires some explanation to folks that are unaware of multiple dispatch object-oriented programming. Generic functions in Common Lisp act as declarations for methods. Methods in lisp are selected at the runtime, just like virtual methods in C++. The difference is how methods are selected. In C++ effective method is selected just based on the singular implicit this argument. In lisp, an effective method is selected on a combination of all arguments passed to the function. In the context of this article, it is important to remember that this allows both range class and aggregation function class to contribute to the apply-aggregation-function class.
Everything that happens calling apply-aggregation-function can be decomposed into the following abstract operations:
- make-state
- aggregate
- state-result
Make-state function shall construct a mutable state of an otherwise immutable function object. Its purpose is to hold any variables needed for the aggregation. It is worth noting that this way we will be able to construct independent aggregation states multiple times, so GROUP BY can be completely agnostic of the concrete function it is working with. An aggregate function will accept both the function representing object, the state constructed by make-state and a single element from the range. State-result is called to extract the final return value of the state for the user.
It is important to keep in mind that this approach allows us to always construct a fresh empty state for the aggregation function.
However, this is not sufficient for multipass aggregators. Here we have to additionally represent stages and therefore protocol becomes somewhat more complicated. We need to augment this already established set with the following functions:
- multi-aggregation-stages
- initialize-stage
Function multi-aggregation-stages called on the arguments on the aggregation-function and list of arguments passed to it by the user will return a list of the mutable-stages. Each stage is a mutable object on its own right and will hold its own state as a slot in the instance. Therefore it makes no sense to call make-state with multi-stage-aggregation-function.
Differences in handling those two function classes are strongly pronounced. We don't like that, and therefore we will combine both into one. We will introduce a new level of masking differences between those two approaches. It will be built around a new data type called aggregator and will consist of the following protocol:
- construct-aggregator
- apply-aggregation-function-with-aggregator
- expects-content-p
- pass-to-aggregation
- begin-aggregation
- end-aggregation
- extract-result
- aggregator-finished-p
An aggregator is a mutable object that will hold both function and either stages or state while presenting a uniform interface. Construct-aggregator accepts function representing the object and therefore it is possible to construct the desired version of the aggregator based on the class of the aggregation function representing object. The once constructed aggregator will be passed to apply-aggregation-function-with-aggregator function where it essentially becomes a state machine. We can modify it by calls to begin-aggregation; end-aggregation; pass-to-aggregation, query by using functions expects-content-p and aggregator-finished-p. We will extract result out of the aggregator by calling (wait for it…) extract-result, the same function will be used to obtain intermediate result out of the aggregation stage.
Things start to come together. Iterating over the data is missing but even so, we already see how aggregator will drive this process. We will simply keep passing all elements from the range into the aggregator with pass-to-aggregation until aggregator-finished-p will return T. We will also have to call begin-aggregation and end-aggregation around passing data to ensure that internal states of the aggregation algorithm can be initialized properly. The actual code is a very simple implementation of this idea.
(defmethod apply-aggregation-function (range (function aggregation-function) &rest all &key key &allow-other-keys) (let ((aggregator (construct-aggregator range key function nil all))) (apply #'apply-aggregation-function-with-aggregator aggregator range function all))) (defmethod apply-aggregation-function-with-aggregator ((aggregator fundamental-aggregator) range (function aggregation-function) &rest all &key &allow-other-keys) (declare (ignore all)) (iterate ;; calling begin-aggregation when aggregator-finished-p is any error, so we will check it here... (until (aggregator-finished-p aggregator)) (begin-aggregation aggregator) ;; and here, to ensure that begin-aggregation did not change aggregator-finished-p result. (until (aggregator-finished-p aggregator)) ;; sometimes the last stage just need call to end-aggregation (when (cl-ds.alg.meta:expects-content-p aggregator) (cl-ds:across range (lambda (x) (pass-to-aggregation aggregator x)))) (end-aggregation aggregator)) (extract-result aggregator))
At this point system is composed out of three distinctive layers.
- Ranges and accross function.
- Aggregation functions and states.
- Aggregator.
Once again I want to point out that whole design boils down truly to separating and combining. Function, state of the function and iteration were separated from each other and combined together with a more convenient way in the aggregator.
We didn't yet arrive at the complete and final design but the pieces are really there.
Control in the GROUP-BY level
Construct-aggregator accepts range for a reason. Although normally aggregator shouldn't care about range once it is constructed we still need a separate to the aggregation function way to control part of the aggregation. The answer is a proxy range, like the CL-DS:FORWARD-GROUP-BY-PROXY. This range does not affect in any way, shape or form data underneath, and exists purely to construct different aggregator.
Group by aggregator will simply check at each element in the range if the grouping value was already seen. If it was not, the new aggregator will be constructed just like it would be from the range beneath the proxy range and placed in the hash table. Next, we will simply pass the value to the sub-aggregator. Extracting result boils down to calling extract-result for each created aggregator and then returning it in the form of the range.
(defclass group-by-aggregator (cl-ds.alg.meta:fundamental-aggregator) ((%groups :initarg :groups :type hash-table :reader read-groups) (%outer-fn :initarg :outer-fn :reader read-outer-fn) (%group-by-key :initarg :group-by-key :reader read-key))) (defmethod cl-ds.alg.meta:pass-to-aggregation ((aggregator group-by-aggregator) element) ;; this code uses metabang-bind that provides bind macro. ;; This macro roles with-slots, with-accessors, destructuring-bind, multiple-value-bind ;; flet and some into one macro. It is pretty dope. (bind (((:slots %group-by-key %groups %outer-fn) aggregator) (selected (~>> element (funcall %group-by-key))) (group (gethash selected %groups))) (when (null group) (setf group (funcall %outer-fn) (gethash selected %groups) group) (cl-ds.alg.meta:begin-aggregation group)) (cl-ds.alg.meta:pass-to-aggregation group element))) (defmethod cl-ds.alg.meta:extract-result ((aggregator group-by-aggregator)) (bind (((:slots %key %groups %outer-fn) aggregator) (groups (copy-hash-table %groups))) (maphash (lambda (key aggregator) (setf (gethash key groups) (cl-ds.alg.meta:extract-result aggregator))) %groups) (make-hash-table-range groups))) (defmethod cl-ds.alg.meta:begin-aggregation ((aggregator group-by-aggregator)) (iterate (for (key value) in-hashtable (read-groups aggregator)) (begin-aggregation value))) (defmethod cl-ds.alg.meta:end-aggregation ((aggregator group-by-aggregator)) ;; iterate is just as dope as metabang-bind (iterate (for (key value) in-hashtable (read-groups aggregator)) (end-aggregation value)))
And that's all. Stay tuned for more information on this.