Passing Along the Cranberry Sauce (Pure Python vs Numpy vs Numba)
Speeding up your program can be difficult and sometimes a worthwhile endeavor
This week, I will explore 538’s Riddler using simulation study and then talk about scaling up the simulation.
Tl;Dr:
Code | Time | Speedup |
---|---|---|
Pure Python | 12.994 | 1x |
Pure Python + Preallocated | 14.516 | 0.9x |
Numpy | 60.416 | 0.22x |
Numpy + Preallocated | 5.995 | 2.17x |
Numba | 1.976 | 6.58x |
Numba once complied | 0.319 | 40.73x |
Numba + Parallelized | 0.315 | 41.25x |
Riddler
From Patrick Lopatto comes a riddle we can all be thankful for:
To celebrate Thanksgiving, you and 19 of your family members are seated at a circular table (socially distanced, of course). Everyone at the table would like a helping of cranberry sauce, which happens to be in front of you at the moment.
Instead of passing the sauce around in a circle, you pass it randomly to the person seated directly to your left or to your right. They then do the same, passing it randomly either to the person to their left or right. This continues until everyone has, at some point, received the cranberry sauce.
Of the 20 people in the circle, who has the greatest chance of being the last to receive the cranberry sauce?
Pure Python
We can start by simulating one round of the cranberry passing using pure python. Hence, we create a list to represent each person on the table. The zeroth (or first) person has cranberry sauce, who then passes it to their left or right with direction = random.randint(2) * 2 - 1
until the cranberry sauce has reached all individuals, visited_count == size
. Since the table is circular, we can use modular arithmetic (i.e. last_pos = (pos + direction) % size
) so that the person on visited[0]
can pass it to the person on visited[size]
in one step and vice-versa.
def pure_python_simulation(size: int) -> int:
"""Simulate one round of cranberry sauce passing."""
visited = [False] * size
visited_count = 1
visited[0] = True
pos, last_pos = 0, 0
while visited_count != size:
direction = random.randint(0, 1) * 2 - 1
last_pos = (pos + direction) % size
pos = last_pos
if not visited[last_pos]:
visited_count += 1
visited[last_pos] = True
return last_pos
As with any simulation study, we need to conduct many simulations. You can look at other posts on why the answer is \(\frac{1}{19} \approx 0.05263\), however, I am more interested in writing about how quickly this approach converges to a reasonable approximation. For example, if we run 1000 rounds of the pure python simulation, the probability of being the last person to receive the cranberry sauce is incorrect on the first significant figure, the tenths place, about ~half of the time.
import sim_pure
sim_pure.main(20, 1000)
## [0.0, 0.049, 0.057, 0.051, 0.05, 0.04, 0.054, 0.067, 0.056, 0.059, 0.048, 0.057, 0.054, 0.043, 0.055, 0.05, 0.056, 0.053, 0.055, 0.046]
While this example has some nice structure such that assuming equal probability of all outcomes is intuitive, other more complicated probability models don’t have that luxury. So, how many rounds of simulation do we need to run so that the result is accurate to the first significant figure for all 20 people on the table – about 50,000. Which, with our current modeling pipeline takes about 13 seconds to simulate. We can benchmark and profile the code with the package cProfile
to see where any bottlenecks are occurring.
cProfile.run("sim_pure.main(20, 50000)")
57010889 function calls in 13.012 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 13.012 13.012 <string>:1(<module>)
9492906 3.599 0.000 9.336 0.000 random.py:174(randrange)
9492906 1.588 0.000 10.924 0.000 random.py:218(randint)
9492906 3.851 0.000 5.738 0.000 random.py:224(_randbelow)
1 0.010 0.010 13.012 13.012 sim_pure.py:23(main)
1 0.000 0.000 0.000 0.000 sim_pure.py:29(<listcomp>)
50000 2.078 0.000 13.002 0.000 sim_pure.py:7(simulation)
1 0.000 0.000 13.012 13.012 {built-in method builtins.exec}
9492906 0.382 0.000 0.382 0.000 {method 'bit_length' of 'int' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
18989260 1.504 0.000 1.504 0.000 {method 'getrandbits' of '_random.Random' objects}
It seems that over 70% of our processing time is taken by calls to random e.g. to which direction do we pass the cranberry sauce. The result is not too surprising as much of computer architecture is currently optimize towards deterministic and vectorize processes through a number of different design decisions. We can try to code for that type of architecture by trying to be smarted about the cache by preallocating the random calls.
Pure Python + preallocating
Preallocating the random calls is easier than it sounds. For our case, we draw prealloc_size
number of random draws and only redraw them when we run out. What the optimal size of that cache is going to depend on a large number of variables (CPU, operating system, other work loads, etc.). I just quickly a couple of different scenarios and found that prealloc_size=10
was optimal which implies that we probably wont get much lift from doing this.
...
i = 0
direction = [random.randint(0, 1) * 2 - 1 for _ in range(prealloc_size)]
while visited_count != size:
try:
last_pos = (pos + direction[i]) % size
except IndexError:
i = 0
direction = [random.randint(0, 1) * 2 - 1 for _ in range(prealloc_size)]
i += 1
...
Unfortunately, this actually slowed us down a bit. Lets look at numpy.
cProfile.run("sim_pure_vec.main(20, 50000, 10)")
59152505 function calls in 14.574 seconds
Numpy
Numpy is my go-to package if I want to do fast array/matrix manipulation in python. The syntax is almost identical to pure python and changing the pure python implementation to numpy is quick. For the most part, we only need to change lists ([]
) into numpy arrays (np.array()
), and be a bit more diligent with the typing.
def simulation(size: int) -> int:
"""Simulate one round of cranberry sauce passing."""
visited = np.zeros(size, dtype=bool)
visited_count = 1
visited[0] = True
pos, last_pos = 0, 0
while visited_count != size:
direction = np.random.randint(2, dtype=int) * 2 - 1
last_pos = (pos + direction) % size
pos = last_pos
if not visited[last_pos]:
visited_count += 1
visited[last_pos] = True
return last_pos
Benchmarking 50000 simulations with CProfile
leads to 60 second benchmark which by over 4.5x slower. That’s not good… The random drawing is slowing down the process, however, there is much more overhead from using numpy which is also slowing the whole process down. Lets quickly try to preallocate and see if that speeds up results.
cProfile.run("sim_numpy.main(20, 50000)")
113942506 function calls in 60.995 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
...
9486875 30.586 0.000 57.066 0.000 {method 'randint' of 'numpy.random.mtrand.RandomState' objects}
Numpy + preallocating
Benchmarking 50000 simulations with CProfile
leads to a 6 second benchmark which is less than 1/2 the time with just pure python! Numpy + preallocating leads to some startling results if you don’t consider what numpy is optimized for. The problem at hand is mostly weighted down by the random process, not by holding in memory or manipulating large arrays. As such, the only large array in our simulation is the preallocated results of the randint
.
...
direction = np.random.randint(2, size=1000, dtype=int) * 2 - 1
while visited_count != size:
try:
last_pos = (pos + direction[i]) % size
except IndexError:
i = 0
direction = np.random.randint(2, size=1000, dtype=int) * 2 - 1
last_pos = (pos + direction[i]) % size
i += 1
pos = last_pos
...
cProfile.run("sim_numpy_vec.main(20, 50000, 1000)")
1150006 function calls in 5.922 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
50000 0.019 0.000 0.269 0.000 <__array_function__ internals>:2(prod)
1 0.000 0.000 5.922 5.922 <string>:1(<module>)
50000 0.023 0.000 0.096 0.000 _dtype.py:319(_name_includes_bit_suffix)
50000 0.055 0.000 0.178 0.000 _dtype.py:333(_name_get)
50000 0.008 0.000 0.008 0.000 _dtype.py:36(_kind_name)
50000 0.004 0.000 0.004 0.000 fromnumeric.py:2838(_prod_dispatcher)
50000 0.028 0.000 0.231 0.000 fromnumeric.py:2843(prod)
50000 0.056 0.000 0.203 0.000 fromnumeric.py:73(_wrapreduction)
50000 0.016 0.000 0.016 0.000 fromnumeric.py:74(<dictcomp>)
100000 0.029 0.000 0.041 0.000 numerictypes.py:293(issubclass_)
50000 0.029 0.000 0.073 0.000 numerictypes.py:365(issubdtype)
1 0.045 0.045 5.922 5.922 sim_numpy_vec.py:30(main)
1 0.000 0.000 0.000 0.000 sim_numpy_vec.py:36(<listcomp>)
50000 5.041 0.000 5.877 0.000 sim_numpy_vec.py:7(simulation)
1 0.000 0.000 5.922 5.922 {built-in method builtins.exec}
50000 0.015 0.000 0.015 0.000 {built-in method builtins.getattr}
200000 0.019 0.000 0.019 0.000 {built-in method builtins.issubclass}
50000 0.015 0.000 0.245 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
50001 0.039 0.000 0.039 0.000 {built-in method numpy.zeros}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
50000 0.014 0.000 0.014 0.000 {method 'format' of 'str' objects}
50000 0.004 0.000 0.004 0.000 {method 'items' of 'dict' objects}
50000 0.351 0.000 0.797 0.000 {method 'randint' of 'numpy.random.mtrand.RandomState' objects}
50000 0.112 0.000 0.112 0.000 {method 'reduce' of 'numpy.ufunc' objects}
So numpy can be potentially faster 2x but we’ll need to figure out the optimal preallocated size which is not great and is too dependent on hardware and specifics of the machine. Even worst, a poor preallocation will result in speeds much worst than using just numpy.
Numba
Numba is a great python package which mainly translates sections of python code to machine code to speed up the program. Numba can also setup simple concurrent pipelines.
For both of the approaches, the code will use the decorator @numba.njit(parallel=True)
which attempts to translates the specific python function into machine code. Numba then uses a JIT (Just-In-Time) compiler to compile and run the python + Numba code. Numba may require stronger typing than python, although for our code, Numba uses a similar level of typing as Numpy.
Single core Numba
With good modular coding practices, a programmer can indicate which part of the code ought to be translated to a lower-level language. For our case, the new simulation function is almost identical as before.
@njit(parallel=True)
def simulation(size: int) -> int:
"""Simulate one round of cranberry sauce passing."""
visited = np.zeros(size, dtype=bool_)
visited_count = 1
visited[0] = True
pos, last_pos = 0, 0
while not visited_count == size:
direction = np.random.randint(2) * 2 - 1
last_pos = (pos + direction) % size
pos = last_pos
if not visited[last_pos]:
visited_count += 1
visited[last_pos] = True
return last_pos
The compiling run is already the fastest run yet at 2 seconds, a 6.5x speedup! Once the program is run once, subsequent calls are even faster. In this case, the after compiling run is 0.3 seconds, an over 40x speed up. You can also note that there are much fewer function calls once the program is compiled.
cProfile.run("sim_numba.main(20, 50000, 1)")
3025357 function calls (2825134 primitive calls) in 1.976 seconds
cProfile.run("sim_numba.main(20, 50000, 1)")
4 function calls in 0.315 seconds
Multicore Numba
We can also leverage simple multicore tasks with Numba. In our case, each simulation is completely independent from one another. In other words, the result of any simulation does not influence the result of another simulation. Using that fact, we can distribute each simulation on different workers to try to make the program run even faster. With any parallelized work, the user needs to weigh the generally fixed time costs of distributing the work with the varied benefits of multiple workers.
...
for i in numba.prange(threads):
for _ in range(rounds):
count2d[i, simulation(size)] += 1
count = np.sum(count2d, axis=0)
...
Unfortunate for us, it seems that there wasn’t much of a difference between the single core vs parallelized simulation. This is most likely that the simulation is too small and so the fixed cost of parallelizing the work is of similar size to the benefit from it.
cProfile.run("sim_numba.main(20, 5000, 10)")
4 function calls in 0.344 seconds
cProfile.run("sim_numba.main(20, 500, 100)")
4 function calls in 0.345 seconds
cProfile.run("sim_numba.main(20, 50, 1000)")
4 function calls in 0.332 seconds
Conclusion
Numba is great tool if you have nicely modular python code that needs to run faster. While Numpy was not too useful in this simulation, I still use Numpy extensively for programming in Python that needs linear algebra. Pure python is still reasonably fast and its random
package is still competitive in some situations to Numpy’s own version. As always, choose the tools which best fits the job at hand.
Code | Time | Speedup |
---|---|---|
Pure Python | 12.994 | 1x |
Pure Python + Preallocated | 14.516 | 0.9x |
Numpy | 60.416 | 0.22x |
Numpy + Preallocated | 5.995 | 2.17x |
Numba | 1.976 | 6.58x |
Numba once complied | 0.319 | 40.73x |
Numba + Parallelized | 0.315 | 41.25x |
If you want to look at the code, check out my github at here
Edit: Thank you the Riddler classic for choosing my submission as the winner. Congratulations to 👏 Daniel Silva-Inclan 👏 of Chicago, Illinois, winner of the most recent Riddler Classic.