Next Previous Contents

7. Statistical & Machine Learning

All about getting machines to learn to do something rather than explicitly programming to do it. Tends to deal with pattern matching a lot and are heavily math and statistically based. Technically Connectionism falls under this category, but it is such a large sub-field I'm keeping it in a separate section.

7.1 Libraries

Libraries or frameworks used for writing machine learning systems.


The Cognitive Foundry is a modular Java software library for the research and development of cognitive systems. It contains many reusable components for machine learning, statistics, and cognitive modeling. It is primarily designed to be easy to plug into applications to provide adaptive behaviors.


CompLearn is a software system built to support compression-based learning in a wide variety of applications. It provides this support in the form of a library written in highly portable ANSI C that runs in most modern computer environments with minimal confusion. It also supplies a small suite of simple, composable command-line utilities as simple applications that use this library. Together with other commonly used machine-learning tools such as LibSVM and GraphViz, CompLearn forms an attractive offering in machine-learning frameworks and toolkits.


This repository contains a Torch implementation for both the DeepMask and SharpMask object proposal algorithms.

DeepMask is trained with two objectives: given an image patch, one branch of the model outputs a class-agnostic segmentation mask, while the other branch outputs how likely the patch is to contain an object. At test time, DeepMask is applied densely to an image and generates a set of object masks, each with a corresponding objectness score. These masks densely cover the objects in an image and can be used as a first step for object detection and other tasks in computer vision.

SharpMask is an extension of DeepMask which generates higher-fidelity masks using an additional top-down refinement step. The idea is to first generate a coarse mask encoding in a feedforward pass, then refine this mask encoding in a top-down pass using features at successively lower layers. This result in masks that better adhere to object boundaries.


Dlib's machine learning library. A major design goal of this portion of the library is to provide a highly modular and simple architecture for dealing with kernel algorithms. Towards this end, dlib takes a generic programming approach using C++ templates. In particular, each algorithm is parameterized to allow a user to supply either one of the predefined dlib kernels (e.g. RBF operating on column vectors), or a new user defined kernel. Moreover, the implementations of the algorithms are totally separated from the data on which they operate. This makes the dlib implementation generic enough to operate on any kind of data, be it column vectors, images, or some other form of structured data. All that is necessary is an appropriate kernel.


Elefant (Efficient Learning, Large-scale Inference, and Optimisation Toolkit) is an open source library for machine learning licensed under the Mozilla Public License (MPL). We develop an open source machine learning toolkit which provides


FastText is a library for fast text representation and classification. It supports both text classification and learning word vector representations through techniques like bag of words and subword information. Based on the skip-gram model, words are represented as bag of character n-grams with vectors representing each character n-gram.


goml (pronounced like the data format 'toml') is a batteries included machine learning library written entirely in Golang. It lets you create models of data stored as float64's, persist them to disk, and predict other values from them. The coolest part, among many cool parts, is that you can train most models in an on-line fashion, learning in a 'reactive' manner while waiting for further data on channels! Most models can also be trained in batch settings, using either stochastic or batch gradient descent.


Mahout's goal is to build scalable machine learning libraries. Scalable to reasonably large data sets. Our core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms.

Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.

Maximum Entropy Toolkit

The Maximum Entropy Toolkit provides a set of tools and library for constructing maximum entropy (maxent) model in either Python or C++.

Maxent Entropy Model is a general purpose machine learning framework that has proved to be highly expressive and powerful in statistical natural language processing, statistical physics, computer vision and many other fields.


MBT is a memory-based tagger-generator and tagger in one. The tagger-generator part can generate a sequence tagger on the basis of a training set of tagged sequences; the tagger part can tag new sequences. MBT can, for instance, be used to generate part-of-speech taggers or chunkers for natural language processing. It has also been used for named-entity recognition, information extraction in domain-specific texts, and disfluency chunking in transcribed speech.


Milk is a machine learning toolkit in Python. It's focus is on supervised classification with several classifiers available: SVMs (based on libsvm), k-NN, random forests, decision trees. It also performs feature selection. These classifiers can be combined in many ways to form different classification systems. For unsupervised learning, milk supports k-means clustering and affinity propagation.

MLAP book samples

Not a library per-say, but a whole slew of example machine learning algorithms from the book "Machine Learning: An Algorithmic Perspective" by Stephen Marsland. All code is written in python.


NLTK, the Natural Language Toolkit, is a suite of Python libraries and programs for symbolic and statistical natural language processing. NLTK includes graphical demonstrations and sample data. It is accompanied by extensive documentation, including tutorials that explain the underlying concepts behind the language processing tasks supported by the toolkit.

NLTK is ideally suited to students who are learning NLP (natural language processing) or conducting research in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning. NLTK has been used successfully as a teaching tool, as an individual study tool, and as a platform for prototyping and building research systems.


MLlib is Spark's machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.


Mlpack is a C++ machine learning library with emphasis on scalability, speed, and ease-of-use. Its aim is to make machine learning possible for novice users by means of a simple, consistent API, while simultaneously exploiting C++ language features to provide maximum performance and maximum flexibility for expert users. It is released free of charge, under the GNU Lesser General Public License (LGPL) version 3.

OpenAI Gym

A toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Go.

It provides the environment; you provide the algorithm. You can write your agent using your existing numerical computation library, such as TensorFlow or Theano.


OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library. OpenCV was built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in the commercial products.

Example applications of the OpenCV library are Human-Computer Interaction (HCI); Object Identification, Segmentation and Recognition; Face Recognition; Gesture Recognition; Motion Tracking, Ego Motion, Motion Understanding; Structure From Motion (SFM); and Mobile Robotics.


Peach is a pure-python module, based on SciPy and NumPy to implement algorithms for computational intelligence and machine learning. Methods implemented include, but are not limited to, artificial neural networks, fuzzy logic, genetic algorithms, swarm intelligence and much more.

The aim of this library is primarily educational. Nonetheless, care was taken to make the methods implemented also very efficient.


Pebl is a python library and command line application for learning the structure of a Bayesian network given prior knowledge and observations. Pebl includes the following features:


PyBrain is a modular Machine Learning Library for Python. It's goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.

PyBrain contains algorithms for neural networks, for reinforcement learning (and the combination of the two), for unsupervised learning, and evolution. Since most of the current problems deal with continuous state and action spaces, function approximators (like neural networks) must be used to cope with the large dimensionality. Our library is built around neural networks in the kernel and all of the training methods accept a neural network as the to-be-trained instance. This makes PyBrain a powerful tool for real-life tasks.


Pyro is a universal probabilistic programming language (PPL) written in Python and supported by PyTorch on the backend. Pyro enables flexible and expressive deep probabilistic modeling, unifying the best of modern deep learning and Bayesian modeling. It was designed with these key principles:


RL-Glue provides a standard interface that allows you to connect agents, environments, and experiment programs together, even if they are written in different languages. This has a number of benefits, such as:


scikits-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (numpy, scipy, matplotlib). It aims to provide simple and efficient solutions to learning problems that are accessible to everybody and reusable in various contexts: machine-learning as a versatile tool for science and engineering.


The machine learning toolbox's focus is on large scale kernel methods and especially on Support Vector Machines (SVM). It provides a generic SVM object interfacing to several different SVM implementations, among them the state of the art LibSVM and SVMLight. Each of the SVMs can be combined with a variety of kernels. The toolbox not only provides efficient implementations of the most common kernels, like the Linear, Polynomial, Gaussian and Sigmoid Kernel but also comes with a number of recent string kernels as e.g. the Locality Improved, Fischer, TOP, Spectrum, Weighted Degree Kernel (with shifts). For the latter the efficient LINADD optimizations are implemented. Also SHOGUN offers the freedom of working with custom pre-computed kernels. One of its key features is the combined kernel which can be constructed by a weighted linear combination of a number of sub-kernels, each of which not necessarily working on the same domain. An optimal sub-kernel weighting can be learned using Multiple Kernel Learning. Currently SVM 2-class classification and regression problems can be dealt with. However SHOGUN also implements a number of linear methods like Linear Discriminant Analysis (LDA), Linear Programming Machine (LPM), (Kernel) Perceptrons and features algorithms to train hidden markov models. The input feature-objects can be dense, sparse or strings and of type int/short/double/char and can be converted into different feature types. Chains of preprocessors (e.g. substracting the mean) can be attached to each feature object allowing for on-the-fly pre-processing.

SHOGUN is implemented in C++ and interfaces to Matlab(tm), R, Octave and Python.


spaCy is a library for advanced natural language processing in Python and Cython. spaCy is built on the very latest research, but it isn't researchware. It was designed from day one to be used in real products. spaCy currently supports English, German and French, as well as tokenization for Spanish, Italian, Portuguese, Dutch, Swedish, Finnish, Norwegian, Hungarian, Bengali, Hebrew, Chinese and Japanese. It's commercial open-source software, released under the MIT license.


SystemML provides declarative large-scale machine learning (ML) that aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations, to distributed computations on Apache Hadoop and Apache Spark. SystemML's distinguishing characteristics are: (1) algorithm customizability, (2) multiple execution modes, including Standalone, Hadoop Batch, and Spark Batch, and (3) automatic optimization.


TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.


TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, translation, and more.


The Tilburg Memory Based Learner, TiMBL, is a tool for NLP research, and for many other domains where classification tasks are learned from examples. It is an efficient implementation of k-nearest neighbor classifier.

TiMBL's features are:

7.2 Applications

Full applications that implement various machine learning or statistical systems oriented toward general learning (i.e., no spam filters and the like).


The dbacl project consist of a set of lightweight UNIX/POSIX utilities which can be used, either directly or in shell scripts, to classify text documents automatically, according to Bayesian statistical principles.


Torch provides a Matlab-like environment for state-of-the-art machine learning algorithms. It is easy to use and provides a very efficient implementation, thanks to an easy and fast scripting language (Lua) and a underlying C implementation.

Vowpal Wabbit

Vowpal Wabbit is a fast online learning algorithm. It features:

The core algorithm is specialist gradient descent (GD) on a loss function (several are available), The code should be easily usable.

Next Previous Contents