GNU/Linux AI & Alife HOWTO: Statistical & Machine Learning

7. Statistical & Machine Learning

All about getting machines to learn to do something rather than explicitly programming to do it. Tends to deal with pattern matching a lot and are heavily math and statistically based. Technically Connectionism falls under this category, but it is such a large sub-field I'm keeping it in a separate section.

7.1 Libraries

Libraries or frameworks used for writing machine learning systems.

CognitiveFoundry

Web site: http://foundry.sandia.gov/

The Cognitive Foundry is a modular Java software library for the research and development of cognitive systems. It contains many reusable components for machine learning, statistics, and cognitive modeling. It is primarily designed to be easy to plug into applications to provide adaptive behaviors.

CompLearn

Web site: http://complearn.org/

CompLearn is a software system built to support compression-based learning in a wide variety of applications. It provides this support in the form of a library written in highly portable ANSI C that runs in most modern computer environments with minimal confusion. It also supplies a small suite of simple, composable command-line utilities as simple applications that use this library. Together with other commonly used machine-learning tools such as LibSVM and GraphViz, CompLearn forms an attractive offering in machine-learning frameworks and toolkits.

DeppMask-SharpMask

Web site: https://github.com/facebookresearch/deepmask

This repository contains a Torch implementation for both the DeepMask and SharpMask object proposal algorithms.

DeepMask is trained with two objectives: given an image patch, one branch of the model outputs a class-agnostic segmentation mask, while the other branch outputs how likely the patch is to contain an object. At test time, DeepMask is applied densely to an image and generates a set of object masks, each with a corresponding objectness score. These masks densely cover the objects in an image and can be used as a first step for object detection and other tasks in computer vision.

SharpMask is an extension of DeepMask which generates higher-fidelity masks using an additional top-down refinement step. The idea is to first generate a coarse mask encoding in a feedforward pass, then refine this mask encoding in a top-down pass using features at successively lower layers. This result in masks that better adhere to object boundaries.

dlib

Web site: http://dlib.net/ml.html

Dlib's machine learning library. A major design goal of this portion of the library is to provide a highly modular and simple architecture for dealing with kernel algorithms. Towards this end, dlib takes a generic programming approach using C++ templates. In particular, each algorithm is parameterized to allow a user to supply either one of the predefined dlib kernels (e.g. RBF operating on column vectors), or a new user defined kernel. Moreover, the implementations of the algorithms are totally separated from the data on which they operate. This makes the dlib implementation generic enough to operate on any kind of data, be it column vectors, images, or some other form of structured data. All that is necessary is an appropriate kernel.

Elefant

Web site: http://elefant.developer.nicta.com.au/

Elefant (Efficient Learning, Large-scale Inference, and Optimisation Toolkit) is an open source library for machine learning licensed under the Mozilla Public License (MPL). We develop an open source machine learning toolkit which provides

algorithms for machine learning utilising the power of multi-core/multi-threaded processors/operating systems (Linux, WIndows, Mac OS X),
a graphical user interface for users who want to quickly prototype machine learning experiments,
tutorials to support learning about Statistical Machine Learning (Statistical Machine Learning at The Australian National University), and
detailed and precise documentation for each of the above.

fastText

Web site: https://github.com/facebookresearch/fastText

FastText is a library for fast text representation and classification. It supports both text classification and learning word vector representations through techniques like bag of words and subword information. Based on the skip-gram model, words are represented as bag of character n-grams with vectors representing each character n-gram.

goml

Web site: https://github.com/cdipaolo/goml

goml (pronounced like the data format 'toml') is a batteries included machine learning library written entirely in Golang. It lets you create models of data stored as float64's, persist them to disk, and predict other values from them. The coolest part, among many cool parts, is that you can train most models in an on-line fashion, learning in a 'reactive' manner while waiting for further data on channels! Most models can also be trained in batch settings, using either stochastic or batch gradient descent.

Mahout

Web site: https://mahout.apache.org/

Mahout's goal is to build scalable machine learning libraries. Scalable to reasonably large data sets. Our core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms.

Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.

Maximum Entropy Toolkit

Web site: http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html

The Maximum Entropy Toolkit provides a set of tools and library for constructing maximum entropy (maxent) model in either Python or C++.

Maxent Entropy Model is a general purpose machine learning framework that has proved to be highly expressive and powerful in statistical natural language processing, statistical physics, computer vision and many other fields.

MBT

Web site: http://ilk.uvt.nl/mbt/

MBT is a memory-based tagger-generator and tagger in one. The tagger-generator part can generate a sequence tagger on the basis of a training set of tagged sequences; the tagger part can tag new sequences. MBT can, for instance, be used to generate part-of-speech taggers or chunkers for natural language processing. It has also been used for named-entity recognition, information extraction in domain-specific texts, and disfluency chunking in transcribed speech.

Milk

Web site: http://packages.python.org/milk/
Web site: https://github.com/luispedro/milk

Milk is a machine learning toolkit in Python. It's focus is on supervised classification with several classifiers available: SVMs (based on libsvm), k-NN, random forests, decision trees. It also performs feature selection. These classifiers can be combined in many ways to form different classification systems. For unsupervised learning, milk supports k-means clustering and affinity propagation.

MLAP book samples

Web site: http://seat.massey.ac.nz/personal/s.r.marsland/MLBook.html

Not a library per-say, but a whole slew of example machine learning algorithms from the book "Machine Learning: An Algorithmic Perspective" by Stephen Marsland. All code is written in python.

NLTK

Web site: http://nltk.org/

NLTK, the Natural Language Toolkit, is a suite of Python libraries and programs for symbolic and statistical natural language processing. NLTK includes graphical demonstrations and sample data. It is accompanied by extensive documentation, including tutorials that explain the underlying concepts behind the language processing tasks supported by the toolkit.

NLTK is ideally suited to students who are learning NLP (natural language processing) or conducting research in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning. NLTK has been used successfully as a teaching tool, as an individual study tool, and as a platform for prototyping and building research systems.

MLlib

Web site: https://spark.apache.org/docs/latest/mllib-guide.html

MLlib is Spark's machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.

mlpack

Web site: http://mlpack.org/

Mlpack is a C++ machine learning library with emphasis on scalability, speed, and ease-of-use. Its aim is to make machine learning possible for novice users by means of a simple, consistent API, while simultaneously exploiting C++ language features to provide maximum performance and maximum flexibility for expert users. It is released free of charge, under the GNU Lesser General Public License (LGPL) version 3.

OpenAI Gym

Web site: https://gym.openai.com/
Web site: https://github.com/openai/gym

A toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Go.

It provides the environment; you provide the algorithm. You can write your agent using your existing numerical computation library, such as TensorFlow or Theano.

OpenCV

Web site: http://opencv.org/

OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library. OpenCV was built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in the commercial products.

Example applications of the OpenCV library are Human-Computer Interaction (HCI); Object Identification, Segmentation and Recognition; Face Recognition; Gesture Recognition; Motion Tracking, Ego Motion, Motion Understanding; Structure From Motion (SFM); and Mobile Robotics.

peach

Web site: http://code.google.com/p/peach/

Peach is a pure-python module, based on SciPy and NumPy to implement algorithms for computational intelligence and machine learning. Methods implemented include, but are not limited to, artificial neural networks, fuzzy logic, genetic algorithms, swarm intelligence and much more.

The aim of this library is primarily educational. Nonetheless, care was taken to make the methods implemented also very efficient.

pebl

Web site: http://code.google.com/p/pebl-project/

Pebl is a python library and command line application for learning the structure of a Bayesian network given prior knowledge and observations. Pebl includes the following features:

Can learn with observational and interventional data
Handles missing values and hidden variables using exact and heuristic methods
Provides several learning algorithms; makes creating new ones simple
Has facilities for transparent parallel execution using several cluster/grid resources
Calculates edge marginals and consensus networks
Presents results in a variety of formats

PyBrain

Web site: http://pybrain.org/

PyBrain is a modular Machine Learning Library for Python. It's goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.

PyBrain contains algorithms for neural networks, for reinforcement learning (and the combination of the two), for unsupervised learning, and evolution. Since most of the current problems deal with continuous state and action spaces, function approximators (like neural networks) must be used to cope with the large dimensionality. Our library is built around neural networks in the kernel and all of the training methods accept a neural network as the to-be-trained instance. This makes PyBrain a powerful tool for real-life tasks.

Pyro

Web site: http://pyro.ai/

Pyro is a universal probabilistic programming language (PPL) written in Python and supported by PyTorch on the backend. Pyro enables flexible and expressive deep probabilistic modeling, unifying the best of modern deep learning and Bayesian modeling. It was designed with these key principles:

Universal: Pyro can represent any computable probability distribution.
Scalable: Pyro scales to large data sets with little overhead.
Minimal: Pyro is implemented with a small core of powerful, composable abstractions.
Flexible: Pyro aims for automation when you want it, control when you need it.

RL-Glue

Web site: http://glue.rl-community.org/wiki/Main_Page

RL-Glue provides a standard interface that allows you to connect agents, environments, and experiment programs together, even if they are written in different languages. This has a number of benefits, such as:

Program agents and environments in the language of your choice
Re-use agents and environments that have been created by other members of the reinforcement learning community
Easily share your own agents and environments, making it easier for people to compare and build on your work
Teach a class that involves reinforcement learning, leveraging all of the "boring" plumbing provided by RL-Glue
Extend your crazy simulator/game/problem to work with RL-Glue, making it accessible to the rest of the reinforcement learning community
Run reinforcement learning agents on your robot, in the language of your choice

scikits.learn

Web site: http://scikit-learn.org/stable/

scikits-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (numpy, scipy, matplotlib). It aims to provide simple and efficient solutions to learning problems that are accessible to everybody and reusable in various contexts: machine-learning as a versatile tool for science and engineering.

Shogun

Web site: http://www.shogun-toolbox.org/

The machine learning toolbox's focus is on large scale kernel methods and especially on Support Vector Machines (SVM). It provides a generic SVM object interfacing to several different SVM implementations, among them the state of the art LibSVM and SVMLight. Each of the SVMs can be combined with a variety of kernels. The toolbox not only provides efficient implementations of the most common kernels, like the Linear, Polynomial, Gaussian and Sigmoid Kernel but also comes with a number of recent string kernels as e.g. the Locality Improved, Fischer, TOP, Spectrum, Weighted Degree Kernel (with shifts). For the latter the efficient LINADD optimizations are implemented. Also SHOGUN offers the freedom of working with custom pre-computed kernels. One of its key features is the combined kernel which can be constructed by a weighted linear combination of a number of sub-kernels, each of which not necessarily working on the same domain. An optimal sub-kernel weighting can be learned using Multiple Kernel Learning. Currently SVM 2-class classification and regression problems can be dealt with. However SHOGUN also implements a number of linear methods like Linear Discriminant Analysis (LDA), Linear Programming Machine (LPM), (Kernel) Perceptrons and features algorithms to train hidden markov models. The input feature-objects can be dense, sparse or strings and of type int/short/double/char and can be converted into different feature types. Chains of preprocessors (e.g. substracting the mean) can be attached to each feature object allowing for on-the-fly pre-processing.

SHOGUN is implemented in C++ and interfaces to Matlab(tm), R, Octave and Python.

spaCy

Web site: https://spacy.io/
Web site: https://github.com/explosion/spaCy

spaCy is a library for advanced natural language processing in Python and Cython. spaCy is built on the very latest research, but it isn't researchware. It was designed from day one to be used in real products. spaCy currently supports English, German and French, as well as tokenization for Spanish, Italian, Portuguese, Dutch, Swedish, Finnish, Norwegian, Hungarian, Bengali, Hebrew, Chinese and Japanese. It's commercial open-source software, released under the MIT license.

SystemML

Web site: http://systemml.apache.org/
Web site: https://github.com/apache/incubator-systemml

SystemML provides declarative large-scale machine learning (ML) that aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations, to distributed computations on Apache Hadoop and Apache Spark. SystemML's distinguishing characteristics are: (1) algorithm customizability, (2) multiple execution modes, including Standalone, Hadoop Batch, and Spark Batch, and (3) automatic optimization.

TensorFlow

Web site: http://tensorflow.org/

TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

TextBlob

Web site: https://textblob.readthedocs.org/en/latest/
Web site: https://github.com/sloria/textblob

TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, translation, and more.

timbl

Web site: http://ilk.uvt.nl/timbl/

The Tilburg Memory Based Learner, TiMBL, is a tool for NLP research, and for many other domains where classification tasks are learned from examples. It is an efficient implementation of k-nearest neighbor classifier.

TiMBL's features are:

Fast, decision-tree-based implementation of k-nearest neighbor lassification;
Implementations of IB1 and IB2, IGTree, TRIBL, and TRIBL2 algorithms;
Similarity metrics: Overlap, MVDM, Jeffrey Divergence, Dot product, Cosine;
Feature weighting metrics: information gain, gain ratio, chi squared, shared variance;
Distance weighting metrics: inverse, inverse linear, exponential decay;
Extensive verbosity options to inspect nearest neighbor sets;
Server functionality and extensive API;
Fast leave-one-out testing and internal cross-validation;
and Handles user-defined example weighting.

7.2 Applications

Full applications that implement various machine learning or statistical systems oriented toward general learning (i.e., no spam filters and the like).

dbacl

Web site: http://dbacl.sourceforge.net/

The dbacl project consist of a set of lightweight UNIX/POSIX utilities which can be used, either directly or in shell scripts, to classify text documents automatically, according to Bayesian statistical principles.

Torch

Web site: http://www.torch.ch/
Old versions: Torch5 Torch3

Torch provides a Matlab-like environment for state-of-the-art machine learning algorithms. It is easy to use and provides a very efficient implementation, thanks to an easy and fast scripting language (Lua) and a underlying C implementation.

Vowpal Wabbit

Web site: http://hunch.net/~vw/

Vowpal Wabbit is a fast online learning algorithm. It features:

flexible input data specification
speedy learning
scalability (bounded memory footprint, suitable for distributed computation)
feature pairing

The core algorithm is specialist gradient descent (GD) on a loss function (several are available), The code should be easily usable.

Next Previous Contents