For what it is worth, my take on the article. A really over whelming list. Nice read through but for those that are interested, the most useful components that were discussed were probably:
- CPython. Of course, that is what we all use.
- PyPy. This is an interesting acceleration if you do not need things like numpy and a lot of other common libraries. The acceleration is maybe 9X in my experience. However C code or good use of numba can often get 100X.
- MicroPython. Not tried but seems cool if you need a really small Python. Presumably not exactly compatible because of missing libraries.
- Pyston. Have not tried but seemed interesting from their discussion of the "pyston_lite_autoload" thing. Have no idea if it is useful.
- Cython. Lot of hoopla about this. Good software but my experience is that you do not get much for speedup until you statically declare stuff. When I did that I got about 24X, then playing with prange and openmp features I got 75X. Not a bad speed up. However, it does not look so good when compared with writing C code or using numba. Mainly because those speedups using other methods seem to be easier to get and I got as large as 121X when using them instead. Cython is just complex to use and then does not get your full entitlement with respect to speed, or at least that was my experience.
- Numba. Numba and Numpy used in the correct situations can give 121X speed improvements and performance similar to parallized and vectorized C code. Actually for some reason it was faster then my C code. This combo is super. Everyone should know about Numba.
- Nuika. Very handy deployment tool. My experience same speed as CPython basically. Well I got about a 9% improvement which is almost nothing. So do not be fooled into thinking that it will give you big speed improvements. A very nice tool as part of your packaging and deployment process.
Since I talked about C code. There are three ways to integrate C code into python: ctypes, CFFI, and using the standard C extension method. I found ctypes to be about 107X, CFFI 108X, and the standard method about 112X for my code on my hardware with code which was using autoparallel and autovectorize, fastmath, and maybe other settings. My point, the speeds of these are about the same though the standard method is just a little faster. So you can really pretty much do whichever is easier.
Anyway my thoughts. Hope they make some sense.