A Comprehensive Evaluation of Widespread Python Implementations

This section describes the composition of the different measured program suites. An in depth description of the contents of each suite can be found here. The test battery is composed of four levels of programs, depending on its complexity. The majority of the programs that compose each of these levels have been taken from already existing benchmarking suites and programs. Programs that use third party non-standard Python libraries or GUI toolkits were not considered, to maximize the amount of python implementations that are able to execute the programs. For the same reason, only standard Python code is measured, without considering any implementation-specific language extension.

A very small fraction of the measured programs have been created to measure certain language characteristics. This was done to test those programming language characteristics that were not covered with the existing ones. These custom tests were created following the structure of existing benchmark suites. All programs have been stripped from its screen output code to obtain more accurate performance results. When user input is needed to operate, the program have been modified to simulate it, so the user have not to input any information to continue its processing. The four levels of programs are:

Microbenchmarks
Benchmarks
Programs
Large Scale Applications

Microbenchmarks

Small programs that deal with only one language characteristic (integer arithmetic, double arithmetic …). They are used to compare the performance of Python implementations considering exclusively that particular characteristic. Already existing benchmarks were used whenever possible, although as we said some of them have to be developed in order to test some characteristics, especially metaprogramming ones. Several microbenchmarks have been ported from other languages with the help of the Java2Python. In total, we measured 309 different microbenchmarks.

Microbenchmark suite	Description
Tommti	A Python translation of the Tommti benchmarking suite: http://www.tommti-systems.de/main-Dateien/reviews/languages/benchmarks.html . This suite is composed by 13 microbenchmarks that measure typical language characteristics such as arithmetic operations and data structure usage among others.
Pybench	Pybench (http://svn.python.org/projects/python/trunk/Tools/pybench/ ) is a collection of 15 files that comprises 57 microbenchmarks, providing a standardized way to measure the performance of Python implementations. It measures different aspects of the Python programming language, allowing us to obtain a different performance number for each characteristic instead of providing a single overall performance estimation.
Java Grande	A Python translation of the Java Grande benchmarks (http://www2.epcc.ed.ac.uk/computing/research_activities/java_grande/index_1.html ). This benchmark suite has the purpose of measuring and comparing alternative Java execution environments in ways which are important to Grande applications. A Grande application is one which uses large amounts of processing, I/O, network bandwidth, or memory. They include applications in science and engineering among others. For our work, the three-layered structure of increasingly complex programs that compose this suite adapts well to our testing layout. Therefore we opted to translate this suite to Python and use its different layers to the corresponding parts of our testing suite. For our measurement work, we used the sequential version of these benchmarks. In this part, only the section 1 of the benchmarks (Low-level operations such as arithmetic and math library operations, method calls, casting…) were used, comprising 9 different code files that executes 83 different microbenchmarks.
Pypy translator microbenchmarks	The PyPy python implementation sources includes 10 code files in its source code distribution that holds up to 82 microbenchmarks. These test individual programming language characteristics in the same way as the other microbenchmarks we used:https://github.com/pypy/pypy/blob/master/pypy/translator/microbench/test_bltn.py
Custom Characteristics	We developed some microbenchmarks to measure those programming language characteristics that we considered that were not appropriately covered by the other microbenchmarks we used. In total we developed 24 microbenchmarks: Process execution: 4 microbenchmarks that test typical process handling primitives. Thread execution: 3 microbenchmarks that test typical thread handling primitives. Bit operations: 5 microbenchmarks that cover bit arithmetic operations in the same way that integer or floating-point arithmetic is covered by other benchmarks. Object comparison: 2 benchmarks that test object comparison by state. Complex data type arithmetic: 3 microbenchmarks that cover arithmetic operations with the Complex data type in the same way that integer or floating point arithmetic is covered by other benchmarks. Rational data type arithmetic: 3 microbenchmarks that cover arithmetic operations with the Rational data type in the same way that integer or floating point arithmetic is covered by other benchmarks. Complementary data access scenario: 1 microbenchmark that cover a specific variable read and write operations that are not tested by other benchmarks. Its structure is similar to the Tommti benchmarks. Set management: 1 microbenchmark that covers set management. Complementary String management use cases: 2 microbenchmarks that test two string handling use cases not covered by other testing suites.
Functional Programming	These benchmarks cover typical functional programming usage scenarios. We developed 7 microbenchmarks following the same structure of existing benchmarking suites. These benchmarks cover: Lambda function creation. Higher order functions usage. Lambda function invocation. Currying. map function. filter function. reduce function.
Metaprogramming	Based on the work developed in: How (and why) developers use the dynamic features of programming languages: the case of Smalltalk (Oscar Callaú, Romain Robbes, Éric Tanter, David Röthlisberger. Empirical Software Engineering DOI 10.1007/s10664-012-9203-2, 2012), and in our past research ( A hybrid class- and prototype-based object model to support language-neutral structural intercession; Francisco Ortin, Miguel A. Labrador, Jose M. Redondo; Information and Software Technology, Volume 56, Issue 2, February 2014, Pages 199-219), we developed a benchmarking suite composed by 46 microbenchmarks that covers the metaprogramming characteristics in the Python language, divided in four layers: Introspection features (8 microbenchmarks) Dynamic Class lookup (1 microbenchmarks) Dynamic method invocation (3 microbenchmarks) Dynamic access to attributes and variables (4 microbenchmarks) Intercession features(24 microbenchmarks) Dynamic class creation (4 microbenchmarks) Add attributes to instances and classes (8 microbenchmarks) Delete attributes to instances and classes (4 microbenchmarks) Add methods to instances and classes (4 microbenchmarks) Delete methods to instances and classes (2 microbenchmarks) Dynamic inheritance usage scenarios (2 microbenchmarks) Computational reflection features (7 microbenchmarks) __getattr__ usage (2 microbenchmarks) __setattr__ usage (2 microbenchmarks) __hasattr__ usage (2 microbenchmarks) __call__ usage (1 microbenchmarks) Dynamic code evaluation features (eval, exec,…)(4 microbenchmarks)

Benchmarks

This group of programs contains a compendium of popular open source benchmarks that are either coded in Python or translated from other programming languages. Each benchmark deals with several language characteristics at once to achieve its final result. So, while not being "real" programs per se, it allows us to test several characteristics in a more realistic environment. We used 61 benchmarks in our measurements. These are:

Benchmark suite	Description
Unladen swallow benchmarks	Unladen Swallow was an optimization branch of CPython, intended to be fully compatible and significantly faster. The project was abandoned due to failing to reach its performance milestones, but its source code is still available and contains a suite of performance programs ( http://code.google.com/p/unladen-swallow/source/browse/#svn%2Ftests%2Fperformance ). This suite is composed by benchmarking programs and tests that use commercial products. We included 24 of these benchmarks in this group.
The Computer Language Benchmarks Game (CLSB)	This is a compendium of benchmarking programs of different nature that have a python version. These are freely available on http://benchmarksgame.alioth.debian.org/ . For our tests we used 10 of these benchmarks.
Parrotbench	Parrot http://www.python.org/ftp/python/parrotbench/is a virtual machine designed to efficiently compile and execute bytecode for dynamic languages. Parrot currently hosts a variety of language implementations in various stages of completion, including Tcl, Javascript, Ruby, Lua, Scheme, PHP, Python, Perl 6, APL, and a .NET bytecode translator. Its distribution comes with 7 benchmarks that cover several dynamic language features, including metaprogramming. We adapted these benchmarks to be used in our measurements.
Jolden - TSP	The popular Jolden benchmark suite has a Java version that can be found in http://www.software.imdea.org/~marron/benchmarks/bench.html. We translated the TSP benchmark from this suite to be used in our measurements.
Java Grande	As we described previously, the Java Grande benchmarks are composed by 3 sections of increasingly complex benchmarks. We translated 6 benchmarks from section 2 and 3 benchmarks of Java Grande section 3, to be included in this group.
Crypto - AES	As cryptography algorithms are commonly used as benchmarking tools, we used a Python implementation of the AES (Advanced Encryption Standard) by Josh Davis (http://www.josh-davis.org ) as a benchmark. The code was obtained from http://www.nullege.com/codes/show/src@p@y@pytoshare-HEAD@pytoshare@lib@encryption@aes.py
Crypto - MD5	For the previous reason, a Python implementation of the MD5 hashing algorithm by Dinu C. Gherman ( http://python.net/~gherman/programs/md5py/ ) was also used as a benchmark
Intercessive Pybench	These are 3 benchmarks that use intercession, developed as part of our previous research. http://www.sciencedirect.com/science/article/pii/S0950584913001778
Dynamic inheritance	These are 4 benchmarks that use dynamic inheritance, developed as part of our previous research. http://www.sciencedirect.com/science/article/pii/S0950584913001778
Pystone	This benchmark is the Python version of the Dhrystone benchmark and is commonly used to compare different implementations of the Python programming language. Pystone is included in the standard CPython distribution.

Programs

This group is composed by applications that are created for a specific and realistic purpose: perform scientific calculations, popular problem solving, data manipulation and parsing, etc. These programs serve much different purposes and have different sizes, due to the different complexity required by each solved problem. The programs in this group are not specifically designed as benchmarks, but as Python applications that, independently of its size, were designed with a realistic purpose in mind.

To obtain programs that fit in this category we used the program collection provided by the Shed Skin Python compiler distribution. Shed Skin (http://code.google.com/p/shedskin/) is an experimental compiler that can translate pure, but implicitly statically typed Python programs into optimized C++. Although this distribution is promising, it lacks support for certain Python programming language features at this moment, so we decided not to include it yet in our measurements. However, the example library it provides (http://code.google.com/p/shedskin/downloads/list ) is excellent, and suits to our purposes.

The Shed Skin program collection is composed by a wide variety of programs of different nature and authors. We adapted the majority of them to perform our measures. The following table describes each one of the 54 programs we used:

Program	Description
Adatron Support Vector Machine	An Adatron Support Vector Machine with polynomial kernel placed in the public domain by Stavros Korokithakis.
Ambient Occlusion Renderer	This program is an ambient occlusion renderer written by Syoyo Fujita http://syoyo.wordpress.com/2009/01/26/ao-bench-is-evolving/
Ant colony optimization	This program generates a random array of distances between cities, then uses Ant Colony Optimization to find a short path traversing all the cities the Travelling Salesman Problem. By Eric Rollins http://eric_rollins.home.mindspring.com/
Arithmetic compressor encoder and decoder	Arithmetic coding compressor and uncompressor for binary data by David MacKay http://www.inference.phy.cam.ac.uk/mackay/python/compress/
Barnes-Hut	A Python implementation of the BH* Olden benchmark ( http://www.irisa.fr/caps/people/truong/M2COct99/Benchmarks/Olden/ ). It implements the Barnes-Hut benchmark that is described in: A hierarchical o(N log N) force-calculation algorithm, ; Barnes, J., Hut, P., Nature, 324:446-449, Dec. 1986
Block compression	A Huffman block compression algorithm by David MacKay http://www.inference.phy.cam.ac.uk/mackay/python/compress/
Brainfuck interpreter	A Brainfuck programming language (http://www.muppetlabs.com/~breadbox/bf/ ) interpreter by Philippe Biondi http://www.secdev.org/
Chaosgame fractals	This program creates chaosgame-like fractals. By Carl Friedrich Bolz https://bitbucket.org/cfbolz/compost/src/a11f10cc7026b4bb138d6f995c91cf990e0a7e23/python/chaosgame.py?at=default
Chess	Simple chess like speed test program written in Python by Jyrki Alakuijala
Color Patterns Shells	This program Models for the simulations of the color pattern on the shells of mollusks
Connect four game	This program implements the connect four (also known as four-in-a-row) game http://users.softlab.ece.ntua.gr/~ttsiod/score4.html
Conway game of life	This program is an implementation of the Game of Life, a cellular automaton devised by the British mathematician John Horton Conway. Its author is Francesco Frassinelli https://github.com/frafra
Dijkstra	Program that uses the Dijkstra shortest-distance algorithm by Gustavo J.A.M. Carneiro http://gjcarneiro.blogspot.com.es/2007/01/dijkstra-in-python.html
Dijkstra Bidirectional	Bidirectional Dijkstra/search algorithm extracted from the NetworkX program (http://networkx.lanl.gov/)
Genetic algorithm	A genetic algorithm implementation
Genetic algorithm 2	Another genetic algorithm implementation by Stavros Korokithakis
Go	An implementation of the popular Go game by Mark Dufour https://sites.google.com/site/markdufour/home
Hq2x filter	Python adaptation of the Hq2x filter demo program by Maxim Stepin http://spacy51.sp.funpic.de/hq_filters/hq2x.html
Jpeg decoder	A JPEG decoding program by Tong Lin http://www.cis.pku.edu.cn/faculty/vision/lintong/default.htm
Kanoodle puzzle	A program that finds solutions to the popular kanoodle puzzle game by David Austin http://www.ams.org/samplings/feature-column/fcarc-kanoodle
Lempel-Ziv compression	A program that performs compression of data using the Lempel-Ziv code by David Mackay http://www.inference.phy.cam.ac.uk/mackay/python/compress/#LZ
Linear algebra	A program that uses various linear algebra operations by Mladen Bestvina
Loop nodes	A program that performs loop recognition in a node graph by Leonardo Maffi http://www.fantascienza.net/leonardo/js/index.html
Mandelbrot	Program that generates and serializes Mandelbrot fractals by Tony Veijalainen http://www.daniweb.com/software-development/python/code/371844/mandelbrot-with-tkinter-and-shedskin
Mastermind	An implementation of the Mastermind game, by Sean McCarthy
Mastermind2	Another implementation of the Mastermind game by Leonardo Maffi http://code.activestate.com/recipes/496907/
Maze solver	A random maze generator/solver by ActiveState: http://code.activestate.com/recipes/496884/
Minimal global illumination renderer	This program is a minimal global illumination renderer (minilight) written by Harrison Ainsworth HXA7241 and Juraj Sukop http://www.hxa.name/minilight/
Neural network	This program implements a Back-Propagation Neural Networks. Its author is Neil Schemenauer http://arctrix.com/nas/python/
Othello game	This program is an implementation of the Othello (also named Reversi or Yang) game by Mark Dufour https://sites.google.com/site/markdufour/home
Path tracer	This program is an implementation of Path tracing, a way of solving a rendering equation using Montecarlo integration. It consist on a Python port of the works of Jonas Wagner (http://29a.ch/ )
Primes by sieve of Atkin	This program computes a finite list of prime numbers that are smaller than a predefined limit by using the sieve of Atkin. Its author is Steve Krenzel https://github.com/stevekrenzel
Probabilistic linear context-free rewriting systems parser for natural language	This program is a natural language parser for PLCFRS (probabilistic linear context-free rewriting systems), an extension of context-free grammar which rewrites tuples of strings instead of strings. Its author is Andreas van Cranenburgh http://staff.science.uva.nl/~acranenb/
RGB Converter	Conversion functions between RGB and other color systems
Richards task management	A Python translation by Mario Wolczko that implements a task management algorithm originally written by Dr. Martin Richards
RSync	This is a Python implementation of the Rsync algorithm: The rsync algorithm, Tridgell A., Mackerras, P.; Technical Report TR-CS-96-05, Canberra 0200 ACT, Australia, 1996. http://rsync.samba.org/
Rubik solver	This program implements a Rubik cube solver algorithm http://pastebin.com/KwGMujyB . It was adapted by Mark Dufour https://sites.google.com/site/markdufour/home
Rubik solver 2	Another Rubik's cube solver using Thistlethwaite's algorithm, originally implemented on C++ by Stefan Pochmann but translated to Python by Mark Dufour https://sites.google.com/site/markdufour/home
Satisfiability solver	This program implements a satisfiability solver. Its author is Mark Dufour https://sites.google.com/site/markdufour/home
SHA	This program is a Python implementation of the SHA-1 algorithm by J. Hall´en and L. Creighton https://seattle.poly.edu/svn/seattle/trunk/seattlelib/sha.repy
Simple mandelbrot	This is an implementation of the Mandelbrot fractal by Daniel Rosengren http://www.timestretch.com/article/mandelbrot_fractal_benchmark
Sokoban	An implementation of the Sokoban game
Solitaire encryption	Python implementation of Bruce Schneier's Solitaire Encryption algorithm by John Dell'Aquila. https://www.schneier.com/
Sudoku solvers	We included 5 different Sudoku solvers by various authors: Jack Ha Peter Goodspeed Peter Cock http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/sudoku Peter Norvig http://norvig.com/sudoku.html Ali Assaf http://www.cs.mcgill.ca/~aassaf9/
Tic Tac Toe	An implementation of the Tic Tac Toe game by Peter Goodspeed
Voronoi	This program implements a Voronoi diagram, which is a way of dividing space into a number of regions. A set of points is specified beforehand and for each seed there will be a corresponding region consisting of all points closer to that seed than to any other. The regions are called Voronoi cells.
Yopyra ray tracer	This program is a complete Python raytracer by Carlos Gonzalez Morcillo https://github.com/tomspur/shedskin/blob/master/examples/yopyra.py It was adapted from the examples included in the Shed Skin experimental Python implementation (http://code.google.com/p/shedskin/ )

Another block of testing programs is composed by the BioPython suite ( http://biopython.org/wiki/Main_Page ). BioPython is a set of freely available tools for biological computation written in Python. Its aim is to develop Python libraries and applications which address the needs of works in bioinformatics. This program suite has a high number of test programs included, that execute different functionalities of the suite using unit testing. Therefore, it makes use of metaprogramming features during its execution. We adapted 85 of these programs and individualized it as part of our measurements.

Large Scale Applications

The final group comprises realistic commercial-grade Python applications. It also includes programs that use the services of a commercial framework, library or API. These programs use a large amount of the program language characteristics we measure, and are a good opportunity to check the performance of the Python implementations we measured with real workloads.

Looking for programs to include in this category is difficult because we couldn’t include those that use third party non-standard libraries or GUI toolkits, in order to try to execute them in the majority of Python implementations we measured. In the end, we included 14 applications in this category:

Program	Description
2to3 python code tanslation	The 2to3 application is included in the Python 3.X distribution to help developers to adapt Python 2 code to the Python 3 specification. This test uses the 2to3 tool to translate itself. It was adapted from the unladen swallow examples. http://code.google.com/p/unladen-swallow/wiki/Benchmarks
Bazaar version control	Bazaar ( http://bazaar.canonical.com/en/ ) is a version control system that helps you track project history over time and to collaborate easily with others. This application uses this application at command line issuing some commands. It was adapted from the unladen swallow examples. http://code.google.com/p/unladen-swallow/wiki/Benchmarks
BlindElephant web application fingerprinter	The BlindElephant Web Application Fingerprinter attempts to discover the version of a web application by comparing static files at known locations against precomputed hashes for versions of those files in all available releases. http://blindelephant.sourceforge.net/ . Testing was done with a local sample web page.
Cog code generation tool	Runs Cog http://www.nedbatchelder.com/code/cog, a simple code generation tool written in Python. http://www.python.org/about/success/cog/
Django template system	Django template system (https://www.djangoproject.com/ ) to build a 150x150-cell HTML table. It was adapted from the unladen swallow examples. http://code.google.com/p/unladen-swallow/wiki/Benchmarks
Html5 library	This application parses the HTML 5 spec (http://svn.whatwg.org/webapps/index ) using html5lib. It was adapted from the unladen swallow examples. http://code.google.com/p/unladen-swallow/wiki/Benchmarks
Mako template system	Mako ( http://www.makotemplates.org/) is a web template library written in Python. It is used by popular web sites such as www.python.org. This application parses a sample web page with this template system. It was adapted from the unladen swallow examples. http://code.google.com/p/unladen-swallow/wiki/Benchmarks
Pyramid web framework	Runs the Pyramid web framework, a small and fast Python web application development framework http://pyramid.readthedocs.org/en/latest/
Rietveld code review	This application uses the Rietveld code review app http://code.google.com/p/rietveld/ along with the Django template system (https://www.djangoproject.com/ ). It was adapted from the unladen swallow examples. http://code.google.com/p/unladen-swallow/wiki/Benchmarks
Spam Bayesian classifier	This application runs a “canned” mailbox through a SpamBayes (http://spambayes.sourceforge.net/ ) ham/spam classifier. It was adapted from the unladen swallow examples. http://code.google.com/p/unladen-swallow/wiki/Benchmarks
Spitfire template system	This application uses the Spitfire template system (http://code.google.com/p/spitfire/ ) to build a 1000x1000-cell HTML table. It was adapted from the unladen swallow examples. http://code.google.com/p/unladen-swallow/wiki/Benchmarks
SQLMap Injection testing tool	Runs SQL Map, an open source penetration testing tool that automates the process of detecting and exploiting SQL injection flaws and taking over of database servers. http://sqlmap.org/
Volatility forensics framework	The Volatility Framework is a completely open collection of tools, implemented in Python under the GNU General Public License, for the extraction of digital artifacts from volatile memory (RAM) samples. https://code.google.com/p/volatility/. A sample RAM dump was used for its execution.
Web2Py web framework	Runs the Web2Py web framework, a free open source full-stack framework for rapid development of fast, scalable, secure and portable database-driven web-based applications. http://www.web2py.com/