diff --git a/meetings/2021/session_07_generators_and_iterators/Generators and Iterators in Python 3.x.ipynb b/meetings/2021/session_07_generators_and_iterators/Generators and Iterators in Python 3.x.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..0a7ea9623636a6599335af1d1a3c31799d44592c --- /dev/null +++ b/meetings/2021/session_07_generators_and_iterators/Generators and Iterators in Python 3.x.ipynb @@ -0,0 +1,692 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Jumping right in: Generator Expressions\n", + "\n", + "Let's say we need a lot of integers for later iteration, solving this with lists will reserve all the space needed in RAM immediately:\n", + "\n", + "(*An integer is 64bit (=8 byte) in Python, hence one million integers are around 8MB..*)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "N = int(1e6)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pointless_list = [i for i in range(N)]\n", + "print(f'we have {sys.getsizeof(pointless_list) / 1e6}MB')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Akin to list comprehensions, generator expressions can be constructed *inline*:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pointless_generator = (i for i in range(N)) # note the round brackets '()' instead of '[]'\n", + "print(f'we have {sys.getsizeof(pointless_generator) / 1e6}MB')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generators have a tiny Memory Footprint\n", + "\n", + "Note the huuuge size difference (and try different values of `N`)!\n", + "\n", + "*(Note that the actual memory consumption on your machine can still be quite different as to what `sys.getsizeof` reports, yet it will never be smaller.. please don't crash your browser)*\n", + "\n", + "Let's iterate over both objects and sum up all the values:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sum(pointless_list) == sum(pointless_generator)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generators are lazy\n", + "So both objects obviously encode the same set of integers, but where does this huge difference in RAM consumption come from? Generators **merely instantiate the recipe** how to generate its elements, there is nothing evaluated/exectuted yet. This is often called **lazy execution**. Maybe best to show this by using a function:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def expensive_routine(num):\n", + " print(f'Heavy RAM/CPU usage {num}')\n", + " return num" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "list_of_results = [expensive_routine(i) for i in range(5)]\n", + "print(list_of_results)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "generator_of_results = (expensive_routine(i) for i in range(5))\n", + "print(generator_of_results)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We see no output after constructing the generator, indicating that indeed nothing was yet exectuted! To actually trigger the execution of the function we need to iterate over the generator:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for result in generator_of_results:\n", + " print(result)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Syntactic Overkill\n", + "When supplied as function arguments, we can leave out the enclosing `()`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sum(x**2 for x in range(3))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is how the joblib example of the EPUG session 5 actually worked syntactically:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def my_job_processor(n_jobs):\n", + " # not sure how they pass the generator around to the workers explicitly..\n", + " def queue(jobs):\n", + " for result in jobs:\n", + " print(result)\n", + " return queue\n", + "\n", + "my_job_processor(n_jobs=3)(expensive_routine(i) for i in range(3))\n", + "# compare to:\n", + "# Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(10))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generators are exhaustable!\n", + "As the elements encoded in the generator are produced one-by-one, there is no way to *go back*:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "square_gen = (x**2 for x in range(5))\n", + "for num in square_gen:\n", + " print(num)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for num in square_gen:\n", + " print(num)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After iterating over the generator once it is *exhausted*, meaning no more elements can be produced from it! Note that iterating over an exhausted generator will produce no errors (just no output), more on this in the following section." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generators are not indexible\n", + "\n", + "This follows from the exhaustability property:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "my_gen = (i for i in range(10))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "my_gen[3]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This works, but will exhaust a part of the generator, so use with utmost care (or even an anti-pattern?):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "7 in my_gen" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for num in my_gen:\n", + " print(num)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Mini Excursus: Iterators and Iterables\n", + "\n", + "These are actually fundamental building blocks of the Python programming language, they are everywhere yet somewhat hidden *under the hood*. \n", + "\n", + "Formally an **iterator object** must have two methods: `__iter__()` and `__next__()`, the first one returns the iterator instance itself and the latter one yields the next element during iteration .\n", + "\n", + "An **iterable** is an object which supports the iterator protocol, meaning we can construct an iterator from it using `my_iterator = iter(iterable)` and grab the next element using `next(my_iterator)`. This works on all *container-like* objects (lists, tuples, strings, dictionaries,...) which have an `__iter__()` method to construct the iterator. Let's try it out:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "some_string = 'ABC'\n", + "# let's get the iterator from this iterable:\n", + "my_iter = iter(some_string)\n", + "# now we can manually iterate over it using next()\n", + "print( next(my_iter) )\n", + "print( next(my_iter) )\n", + "print( next(my_iter) )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When we iterate too far, the `StopIteration` exception is raised:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print( next(my_iter) )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### for loops\n", + "This is how `for loops` actually work: they 1st construct the iterator from the object to be iterated over (the iterable) via `iter()`, then call `next()` until the `StopIteration` exception is silently raised and caught. Of course you can also supply an iterator directly to a for loop, as `iter()` then just returns that very iterator:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "some_dict = {'key-1' : 'value-1', 'key-2' : 'value-2'}\n", + "# we can loop directly over a dictionary\n", + "for key in some_dict:\n", + " print(key, some_dict[key])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# but we can also first construct an iterator \n", + "my_dict_iter = iter(some_dict)\n", + "for key in my_dict_iter:\n", + " print(key, some_dict[key])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that iterators also can be exhausted, using the same iterator again will produce nothing as we silently run into the `StopIteration` exception:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for key in my_dict_iter:\n", + " print(key, some_dict[key])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Yet iterating over the same dictionary again of course works because a new iterator is constructed *under the hood* by the for loop:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for key in some_dict: # here a new iterator is created\n", + " print(key, some_dict[key])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's see how this works for files:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "with open('foo.bar', 'w') as Output:\n", + " Output.write('Line 1\\n')\n", + " Output.write('Line 2\\n')\n", + " \n", + "with open('foo.bar', 'r') as Input:\n", + " file_iter = iter(Input)\n", + " print( next(file_iter) )\n", + " print( next(file_iter) )\n", + " print( next(file_iter) )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is of course the same as doing:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "with open('foo.bar', 'r') as Input:\n", + " for line in Input: # here the file iterator gets created\n", + " print(line)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Again, the `StopIteration` exception was silently cought by the for loop. This is why it's super elegant and efficient to loop over a file *line-by-line* as the iterator will only read one line at a time into memory. Calling `Input.readlines()` will put all the file contents into your RAM at once!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Maybe you have guessed it by now: **Generators are Iterators!** Iterators are more general as every class which implements `_iter_()` and `_next_()` is an iterator, whereas generators are a bit like syntactic sugar by implementing these methods for us. We go into more details here in the next section.\n", + "\n", + "So the take home message maybe here is:\n", + "\n", + "**You never iterate over iterables (the 'data') directly, there's always first an iterator constructed yielding element by element during iteration. This allows to abstract the 'data' from the iteration process, making things like lazy execution possible**. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Building Generators\n", + "\n", + "Besides writing inline expressions using `()` we can be more explicit and expressive with Pythons `yield` statement to build *generator functions* returning generators: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def fibonacci_numbers(nums):\n", + " x, y = 0, 1\n", + " for _ in range(nums):\n", + " x, y = y, x+y\n", + " yield y\n", + "\n", + "# get the first 10 fibonacci numbers:\n", + "fib10 = fibonacci_numbers(10)\n", + "for number in fib10:\n", + " print(number)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "They `yield` statement pauses the internal for loop and **yields** the results element-by-element. Note that the internal variables keep their respective values in between iteration steps: the state gets remembered!\n", + "\n", + "Is what is returned also really an iterator though? Let's find out:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fib5 = fibonacci_numbers(5) # the old one was exhausted anyways..\n", + "iter(fib5) == fib5 # so yeah, iter() returns our generator itself!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "next(fib5) # and we can call next() on it -> it's an iterator allright :)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So our generator fullfills all required iterator properties, as opposed to say a simple list:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "some_list = [1,2,3]\n", + "iter(some_list) == some_list # a list is an iterable, not an iterator" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Just for completeness, it's possible to mix `yield` and `return` statements in a generator function to have more control about when to raise the `StopIteration` exception:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def fibonacci_numbers(smaller_than):\n", + " x, y = 0, 1\n", + " while True:\n", + " x, y = y, x+y\n", + " \n", + " if y > smaller_than:\n", + " return # this magically raises the correct exception of the iterator protocol\n", + " \n", + " yield y\n", + " \n", + "\n", + "\n", + "list(fibonacci_numbers(smaller_than=1000)) # exhausts the generator by iterating till the end" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Infinite Stream Processing\n", + "\n", + "Generators can be used to process theoretically infinite amounts of data, due to their *lazyness*. With a slight modification we can get all Fibonacci numbers (at least one by one):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def all_fibonacci_numbers():\n", + " x, y = 0, 1\n", + " while True:\n", + " x, y = y, x+y\n", + " yield y\n", + " \n", + " \n", + "all_fibs = all_fibonacci_numbers()\n", + "# get the first 25 fibonacci numbers:\n", + "for _ in range(25):\n", + " print( next(all_fibs) )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This generator will never raise the `StopIteration` exception, which does not violate the iterator definition. However, a direct for loop will never terminate! As the state is remembered, we can just ask for the 26th fibonacci number:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "next(all_fibs) # and so onto infinity.." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Pipelining Generators\n", + "\n", + "Generators can be chained together, this allows for seamless stream processing. Let's say we want to find the first 10 Fibonacci numbers which are divisable by a certain, yet variable, number. We can't know in advance how many Fibonacci numbers we would have to generate for each candidate, but we can still chain it with another generator to process this potentially infinite stream: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def find_divisables(numbers_to_check, divisor):\n", + " \n", + " checked = 0 # to keep track..\n", + " for num in numbers_to_check:\n", + " \n", + " checked += 1 \n", + " if num % divisor == 0:\n", + " print(f'Checked {checked} numbers..')\n", + " yield num\n", + " \n", + " # only ever gets printed with finite input..\n", + " print(f'Checked all {len(numbers_to_check)} numbers!')\n", + " \n", + " \n", + "# sanity check with a finite input\n", + "div_by_3 = find_divisables([1,2,9,11,12,17,18,22,23], divisor=3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "next(div_by_3) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So with this we can find all the fibonacci numbers divisable by our candidate **without needing to know beforehand** how many we have to scan for, and hence potentially saving a lot of resources:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "all_fibs = all_fibonacci_numbers() # the new generator yielding potentially all Fibonacci numbers\n", + "div_by = find_divisables(all_fibs, divisor=23) # nothing was executed yet.." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "next(div_by)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "- Generators are a subclass of the ubiquitous and very *pythonic* Iterators\n", + "- Can be created by either inline expressions `()` or with generator functions sporting the `yield` statement\n", + "- They allow for on-demand aka *lazy* execution ↔ only load into RAM what you really need at the moment\n", + "- Infinite stream processing capabilities\n", + "- Allows for much clearer and more readable code as compared to overly throwing around `while` and `break` and so on.." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Ressources\n", + "\n", + "- https://www.programiz.com/python-programming/generator\n", + "- https://www.programiz.com/python-programming/iterator\n", + "- https://www.analyticsvidhya.com/blog/2020/05/python-iterators-and-generators/\n", + "\n", + "Author: Gregor Mönke" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}