Commit cf5908e7 authored by Gregor Moenke's avatar Gregor Moenke
Browse files

add generator session

parent 1e270474
%% Cell type:markdown id: tags:
## Jumping right in: Generator Expressions
Let's say we need a lot of integers for later iteration, solving this with lists will reserve all the space needed in RAM immediately:
(*An integer is 64bit (=8 byte) in Python, hence one million integers are around 8MB..*)
%% Cell type:code id: tags:
``` python
import sys
N = int(1e6)
```
%% Cell type:code id: tags:
``` python
pointless_list = [i for i in range(N)]
print(f'we have {sys.getsizeof(pointless_list) / 1e6}MB')
```
%% Cell type:markdown id: tags:
Akin to list comprehensions, generator expressions can be constructed *inline*:
%% Cell type:code id: tags:
``` python
pointless_generator = (i for i in range(N)) # note the round brackets '()' instead of '[]'
print(f'we have {sys.getsizeof(pointless_generator) / 1e6}MB')
```
%% Cell type:markdown id: tags:
### Generators have a tiny Memory Footprint
Note the huuuge size difference (and try different values of `N`)!
*(Note that the actual memory consumption on your machine can still be quite different as to what `sys.getsizeof` reports, yet it will never be smaller.. please don't crash your browser)*
Let's iterate over both objects and sum up all the values:
%% Cell type:code id: tags:
``` python
sum(pointless_list) == sum(pointless_generator)
```
%% Cell type:markdown id: tags:
### Generators are lazy
So both objects obviously encode the same set of integers, but where does this huge difference in RAM consumption come from? Generators **merely instantiate the recipe** how to generate its elements, there is nothing evaluated/exectuted yet. This is often called **lazy execution**. Maybe best to show this by using a function:
%% Cell type:code id: tags:
``` python
def expensive_routine(num):
print(f'Heavy RAM/CPU usage {num}')
return num
```
%% Cell type:code id: tags:
``` python
list_of_results = [expensive_routine(i) for i in range(5)]
print(list_of_results)
```
%% Cell type:code id: tags:
``` python
generator_of_results = (expensive_routine(i) for i in range(5))
print(generator_of_results)
```
%% Cell type:markdown id: tags:
We see no output after constructing the generator, indicating that indeed nothing was yet exectuted! To actually trigger the execution of the function we need to iterate over the generator:
%% Cell type:code id: tags:
``` python
for result in generator_of_results:
print(result)
```
%% Cell type:markdown id: tags:
### Syntactic Overkill
When supplied as function arguments, we can leave out the enclosing `()`:
%% Cell type:code id: tags:
``` python
sum(x**2 for x in range(3))
```
%% Cell type:markdown id: tags:
This is how the joblib example of the EPUG session 5 actually worked syntactically:
%% Cell type:code id: tags:
``` python
def my_job_processor(n_jobs):
# not sure how they pass the generator around to the workers explicitly..
def queue(jobs):
for result in jobs:
print(result)
return queue
my_job_processor(n_jobs=3)(expensive_routine(i) for i in range(3))
# compare to:
# Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(10))
```
%% Cell type:markdown id: tags:
### Generators are exhaustable!
As the elements encoded in the generator are produced one-by-one, there is no way to *go back*:
%% Cell type:code id: tags:
``` python
square_gen = (x**2 for x in range(5))
for num in square_gen:
print(num)
```
%% Cell type:code id: tags:
``` python
for num in square_gen:
print(num)
```
%% Cell type:markdown id: tags:
After iterating over the generator once it is *exhausted*, meaning no more elements can be produced from it! Note that iterating over an exhausted generator will produce no errors (just no output), more on this in the following section.
%% Cell type:markdown id: tags:
### Generators are not indexible
This follows from the exhaustability property:
%% Cell type:code id: tags:
``` python
my_gen = (i for i in range(10))
```
%% Cell type:code id: tags:
``` python
my_gen[3]
```
%% Cell type:markdown id: tags:
This works, but will exhaust a part of the generator, so use with utmost care (or even an anti-pattern?):
%% Cell type:code id: tags:
``` python
7 in my_gen
```
%% Cell type:code id: tags:
``` python
for num in my_gen:
print(num)
```
%% Cell type:markdown id: tags:
## Mini Excursus: Iterators and Iterables
These are actually fundamental building blocks of the Python programming language, they are everywhere yet somewhat hidden *under the hood*.
Formally an **iterator object** must have two methods: `__iter__()` and `__next__()`, the first one returns the iterator instance itself and the latter one yields the next element during iteration .
An **iterable** is an object which supports the iterator protocol, meaning we can construct an iterator from it using `my_iterator = iter(iterable)` and grab the next element using `next(my_iterator)`. This works on all *container-like* objects (lists, tuples, strings, dictionaries,...) which have an `__iter__()` method to construct the iterator. Let's try it out:
%% Cell type:code id: tags:
``` python
some_string = 'ABC'
# let's get the iterator from this iterable:
my_iter = iter(some_string)
# now we can manually iterate over it using next()
print( next(my_iter) )
print( next(my_iter) )
print( next(my_iter) )
```
%% Cell type:markdown id: tags:
When we iterate too far, the `StopIteration` exception is raised:
%% Cell type:code id: tags:
``` python
print( next(my_iter) )
```
%% Cell type:markdown id: tags:
### for loops
This is how `for loops` actually work: they 1st construct the iterator from the object to be iterated over (the iterable) via `iter()`, then call `next()` until the `StopIteration` exception is silently raised and caught. Of course you can also supply an iterator directly to a for loop, as `iter()` then just returns that very iterator:
%% Cell type:code id: tags:
``` python
some_dict = {'key-1' : 'value-1', 'key-2' : 'value-2'}
# we can loop directly over a dictionary
for key in some_dict:
print(key, some_dict[key])
```
%% Cell type:code id: tags:
``` python
# but we can also first construct an iterator
my_dict_iter = iter(some_dict)
for key in my_dict_iter:
print(key, some_dict[key])
```
%% Cell type:markdown id: tags:
Note that iterators also can be exhausted, using the same iterator again will produce nothing as we silently run into the `StopIteration` exception:
%% Cell type:code id: tags:
``` python
for key in my_dict_iter:
print(key, some_dict[key])
```
%% Cell type:markdown id: tags:
Yet iterating over the same dictionary again of course works because a new iterator is constructed *under the hood* by the for loop:
%% Cell type:code id: tags:
``` python
for key in some_dict: # here a new iterator is created
print(key, some_dict[key])
```
%% Cell type:markdown id: tags:
Let's see how this works for files:
%% Cell type:code id: tags:
``` python
with open('foo.bar', 'w') as Output:
Output.write('Line 1\n')
Output.write('Line 2\n')
with open('foo.bar', 'r') as Input:
file_iter = iter(Input)
print( next(file_iter) )
print( next(file_iter) )
print( next(file_iter) )
```
%% Cell type:markdown id: tags:
This is of course the same as doing:
%% Cell type:code id: tags:
``` python
with open('foo.bar', 'r') as Input:
for line in Input: # here the file iterator gets created
print(line)
```
%% Cell type:markdown id: tags:
Again, the `StopIteration` exception was silently cought by the for loop. This is why it's super elegant and efficient to loop over a file *line-by-line* as the iterator will only read one line at a time into memory. Calling `Input.readlines()` will put all the file contents into your RAM at once!
%% Cell type:markdown id: tags:
Maybe you have guessed it by now: **Generators are Iterators!** Iterators are more general as every class which implements `_iter_()` and `_next_()` is an iterator, whereas generators are a bit like syntactic sugar by implementing these methods for us. We go into more details here in the next section.
So the take home message maybe here is:
**You never iterate over iterables (the 'data') directly, there's always first an iterator constructed yielding element by element during iteration. This allows to abstract the 'data' from the iteration process, making things like lazy execution possible**.
%% Cell type:markdown id: tags:
## Building Generators
Besides writing inline expressions using `()` we can be more explicit and expressive with Pythons `yield` statement to build *generator functions* returning generators:
%% Cell type:code id: tags:
``` python
def fibonacci_numbers(nums):
x, y = 0, 1
for _ in range(nums):
x, y = y, x+y
yield y
# get the first 10 fibonacci numbers:
fib10 = fibonacci_numbers(10)
for number in fib10:
print(number)
```
%% Cell type:markdown id: tags:
They `yield` statement pauses the internal for loop and **yields** the results element-by-element. Note that the internal variables keep their respective values in between iteration steps: the state gets remembered!
Is what is returned also really an iterator though? Let's find out:
%% Cell type:code id: tags:
``` python
fib5 = fibonacci_numbers(5) # the old one was exhausted anyways..
iter(fib5) == fib5 # so yeah, iter() returns our generator itself!
```
%% Cell type:code id: tags:
``` python
next(fib5) # and we can call next() on it -> it's an iterator allright :)
```
%% Cell type:markdown id: tags:
So our generator fullfills all required iterator properties, as opposed to say a simple list:
%% Cell type:code id: tags:
``` python
some_list = [1,2,3]
iter(some_list) == some_list # a list is an iterable, not an iterator
```
%% Cell type:markdown id: tags:
Just for completeness, it's possible to mix `yield` and `return` statements in a generator function to have more control about when to raise the `StopIteration` exception:
%% Cell type:code id: tags:
``` python
def fibonacci_numbers(smaller_than):
x, y = 0, 1
while True:
x, y = y, x+y
if y > smaller_than:
return # this magically raises the correct exception of the iterator protocol
yield y
list(fibonacci_numbers(smaller_than=1000)) # exhausts the generator by iterating till the end
```
%% Cell type:markdown id: tags:
### Infinite Stream Processing
Generators can be used to process theoretically infinite amounts of data, due to their *lazyness*. With a slight modification we can get all Fibonacci numbers (at least one by one):
%% Cell type:code id: tags:
``` python
def all_fibonacci_numbers():
x, y = 0, 1
while True:
x, y = y, x+y
yield y
all_fibs = all_fibonacci_numbers()
# get the first 25 fibonacci numbers:
for _ in range(25):
print( next(all_fibs) )
```
%% Cell type:markdown id: tags:
This generator will never raise the `StopIteration` exception, which does not violate the iterator definition. However, a direct for loop will never terminate! As the state is remembered, we can just ask for the 26th fibonacci number:
%% Cell type:code id: tags:
``` python
next(all_fibs) # and so onto infinity..
```
%% Cell type:markdown id: tags:
### Pipelining Generators
Generators can be chained together, this allows for seamless stream processing. Let's say we want to find the first 10 Fibonacci numbers which are divisable by a certain, yet variable, number. We can't know in advance how many Fibonacci numbers we would have to generate for each candidate, but we can still chain it with another generator to process this potentially infinite stream:
%% Cell type:code id: tags:
``` python
def find_divisables(numbers_to_check, divisor):
checked = 0 # to keep track..
for num in numbers_to_check:
checked += 1
if num % divisor == 0:
print(f'Checked {checked} numbers..')
yield num
# only ever gets printed with finite input..
print(f'Checked all {len(numbers_to_check)} numbers!')
# sanity check with a finite input
div_by_3 = find_divisables([1,2,9,11,12,17,18,22,23], divisor=3)
```
%% Cell type:code id: tags:
``` python
next(div_by_3)
```
%% Cell type:markdown id: tags:
So with this we can find all the fibonacci numbers divisable by our candidate **without needing to know beforehand** how many we have to scan for, and hence potentially saving a lot of resources:
%% Cell type:code id: tags:
``` python
all_fibs = all_fibonacci_numbers() # the new generator yielding potentially all Fibonacci numbers
div_by = find_divisables(all_fibs, divisor=23) # nothing was executed yet..
```
%% Cell type:code id: tags:
``` python
next(div_by)
```
%% Cell type:markdown id: tags:
## Summary
- Generators are a subclass of the ubiquitous and very *pythonic* Iterators
- Can be created by either inline expressions `()` or with generator functions sporting the `yield` statement
- They allow for on-demand aka *lazy* execution ↔ only load into RAM what you really need at the moment
- Infinite stream processing capabilities
- Allows for much clearer and more readable code as compared to overly throwing around `while` and `break` and so on..
%% Cell type:markdown id: tags:
## Ressources
- https://www.programiz.com/python-programming/generator
- https://www.programiz.com/python-programming/iterator
- https://www.analyticsvidhya.com/blog/2020/05/python-iterators-and-generators/
Author: Gregor Mönke
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment