Define a reusable and lazy pipeline

The problem
I have the idea of a pipeline: it is defined once, it can be reused and when called, it changes its state but does not evaluate eagerly, because it could be “fed” from an infinite source of data.
For example:
- all integers which are multiple of 13 and has in its binary representation more ones than zeroes
- indices of bytes read from /dev/random which are ascii characters
Perhaps there are two separate issues I’m thinking about here:
- Is the pipeline creating intermediate collections when processing items through the pipeline? For example, in Scala, all intermediate operations do create a collection (each map and filter)
- When exactly is the pipeline execution triggered?
- Can the pipeline computation be reused with different collection? Does it need a collection at all?
In scala
Note: see this insightful comment on SO: https://stackoverflow.com/questions/31664918/does-scala-has-intermediate-terminal-ops-as-java8-has
In the collections library Scala makes a distinction between strict and lazy evaluation. A mechanism for lazy evaluation is called view. By default collections evaluate eagerly at every step.
Let’s create an efectful operation that doubles its input and prints it:
|
|
Now, each map or filter on a List evaluates eagerly and produces a list immediately, and the map “traverses” all elements:
|
|
Iterator
We can also create an iterator. Note how the first value gets through the map, even if we haven’t explicitely trigerred it. However, the first element is not actually missing, we will get it eventually when we do next or use other terminal operations causing traversal and evaluation of pipeline operations:
|
|
Reusability - Iterator is not reusable
The iterator approach allows to build a pipeline and evaluate elements when needed. However, we are not able to reuse the pipeline: once the iterator is exhausted, we’re done. We need to create a new iterator, repeating the pipeline code. It seems the pipeline needs to have a collection to work on attached to it. Or, in other words, there is no such thing as a pipeline here, I guess.
Being lazy - scala views
views make the pipeline lazy - but this happens only if we have whole pipeline and we don’t store e.g. map result in a val (then - we have no laziness, but eager creation of strict collection). This is the part that was very confusing to me :)
If I create a view and try to map, I am forcing the evaluation! As you can see, all elements of the list get evaluated and printed, and xv hols a SeqView of all evaluated, already-computed values:
|
|
Lesson: keep entire chaing on one expression if what I need is the laziness.
View and reusability
Unlike iterator, view is reusable - it holds a reference to the original connection, and it reconstructs the pipeline on each terminal operation, doing a minimal, lazy work on each call (starting fresh from the begining, with minimal processing of what is necessary) - the map is not triggered:
|
|
The answer to reusability is: yes, views can be reused. They still are attached to the source collection- you cannot define a pipeline and apply it to a different collection.
LazyList
Docs: LazyList
LazyLists can be infinite, are lazy and their values are memoized. What does it mean? The pipeline would be evaluated in a way that elements that were previously accessed will not be evaluated again.
The usecase for LazyList are infinite collections which we don’t want to evaluate eagerly (like in the beautiful fibonacci example) and for finite collections don’t give any benefit over views"
|
|
Conclusions for Scala
For lazy, efficient pipelines it is a good idea to use aview, and remember to keep the chain *in one expression".
Reusability of a pipeline across different collections is not possible: the best I can think of is to create a function that takes a collection or an iterator. Similar to Java Streams, we can talk about terminal and intermediate operations, but intermediate does not mean lazy in this case, becasue Scala collections API is by default strict and will evaluate everyting unless you opt out using views (now: think about Rust - it is lazy by default).
At least this is my understanding.
Python
In Python, mapping and filtering operations leave you with lazy iterator objects which by itself do not trigger anything - nothing is evaluated until consumed.
|
|
Evaluation happens element-by-element only when something consumes those iterators. One needs to explicitely create a list out of the iterators created by map or filter.
|
|
Note that all elements get processed - map and filter don’t do any short-circuiting…
I remember that Python has itertools module, perhaps I can find some answers there?
Build-in map and filter suffer from reusability problem: they produce just single-use iterators:
|
|
The Fix - wrap it
If we wrap the source in a function then each time the function is called, new iterator will be created, allowing to reuse the pipeline:
|
|
If the pipeline is complex (or we want to keep complex state, eg. when traversing a tree) and we want multiple conditions or early exit, or we want it to be lazy then we should use generators, and also itertools.islice(python doc) which is an equivalet of take:
|
|
Python - summary
Python’s map/filter are lazy but single-use. The language, as far as I know, does not have a “pipeline” construct - the most similar thing I can think of is a function wich can be used to start new iterators when needed - a callable is the reusable abstraction. It cannot be, however, independent from the source: you always need to have a collection or iterator to iterate over.
I once came quite close to the idea of a structure of computations that is independent on the collection - but never actually tried it. It is time to dive into…
Clojure
… clojure’s transducers.
Perhaps the closest concept to what I am searching for is the transducer concept I encountered in clojure. It seems like a definition or composition of operations which
do not require a collection and can exist by itself. The documentation states that operations such as map, filter or take (for others see also cheatscheet) create transducers:
What the docs say is:
Most sequence functions included in Clojure have an arity that produces a transducer. This arity omits the input collection; the inputs will be supplied by the process applying the transducer. Note: this reduced arity is not currying or partial application.
Deriving from the documentation example, I can create xf transducer which is a composition of other transducers (comp is composing right-to-left, so the mental model for final transucer operation is that it executes left-to-right, first filtering, then mapping and finally taking two elements). Final reducing operation is “+” here; transduced sequence before addition can be retrieved by using (into [] ...) form:
|
|
What happens when I define xf such that it prints an element, let’s say: very early, before filter? Would It then print everything? At least, it will not at the time transducer is defined. I expect it to fire for all vec elements, however. I aslo expect that if I place print after take, I will print two filterred and inced values, 4 and 6:
Let’s check. Define a: it takes value a, prints it and returns it.
|
|
Transducer xf now is composed of (map a) as first element of the pipeline. Wow, see what happened: it not only did not call map on each element of the sequence - it only executec exectly 3 times - minimal number of times needed to actually retrieve two elements required by last element of the pipeline…:
|
|
Interesting.
In Rust
This behavior is similar to what we get in Rust - there are iterators everywhere :) For example, below program constructs an infinite collection (wait! it is iterator!) and traverses it:
|
|
So the example with simple collection will be as follows:
|
|
which behaves nicely - does not go through whole collection and gives exactly what I expected = three elements:
|
|
Making the pipeline more generic can be done by making collection’s element generic or by creating a function that takes an Iterator<Item=i32> - which allows to use the pipeline on both vec and set (note non-deteministic output):
|
|
or by also returning an iterator - in this case I need to force collection (note a call to .collect::<Vec<i32>>())to be able to print the results as iterator itself does not implement Display:
|
|
which gives following output (obviously, duplicates disappeared on HashSet construction, and set output changes in each execution of the program):
|
|
Conclusions
Let’s go back to initial question: can I define a pipeline once, keep it lazy and reuse with different inputs?
This question has two separate parts: laziness and reusability, and different languages answer them in a different way.
Python is lazy: map and filter return iterators which are evaluated lazily element-by-element when they are consumed (in terminal operations, like creating a list()), and those iterators are single use: when exhaused, cannot be re-run. To re-use, you can define a function; to short-cirquit, you can use itertools.islice.
Scala defaults to strict evaluation - every map and filter on a List creates a full intermediate collection. You can opt-out using .view which defers all operations until a terminal foreach which will process what’s needed. view-based pipeline can be reused: it holds reference to original collection, and rebuilds the lazy chain on each terminal call - but it needs to be a single expression! (Confusing part: if you assign intermedite result to a val it will force evaluation at that point and break laziness.)
Rust makes laziness default - iterators cause evaluation elem-by-elem, there is no intermediate collections created, and those pipelines short-cirquit nicely with take or take while. If you define a pipeline inside a function that takes impl Iterator, it is (somewhat) independent from the input: you can pass vec, set, infinite range. Very elegant.
Clojure has a concept of a transducer - it is a pipeline definition that is completely decoupled from the source and the collection of results: you define complex transducers by combining simpler transducers using comp and then uou can appply it to different sinks (transduce, into) It evaluates lazily and minimally - only as many elements are processed as needed. Of all four langiuages I experimender with, transducers in Clojure are closest to my idea of an independent pipeline.
Depending on what you need, you can choose what you want! What I needed for sure was the above journey - it let me to order the concept of pipelined computations in my head. Jumping bewteen languages and using all of them at the same time sometimes casues the concepts leak between areas of my brain, causing a bit of confusion :D
Happy hacking!