Connecting duckdb to the mlpack C++ machine learning library
Installing and Loading
INSTALL mlpack FROM community;
LOAD mlpack;
Example
-- Perform adaBoost (using weak learner 'Perceptron' by default)
-- Read 'features' into 'X', 'labels' into 'Y', use optional parameters
-- from 'Z', and prepare model storage in 'M'
CREATE TABLE X AS SELECT * FROM read_csv("https://eddelbuettel.github.io/duckdb-mlpack/data/iris.csv");
CREATE TABLE Y AS SELECT * FROM read_csv("https://eddelbuettel.github.io/duckdb-mlpack/data/iris_labels.csv");
CREATE TABLE Z (name VARCHAR, value VARCHAR);
INSERT INTO Z VALUES ('iterations', '50'), ('tolerance', '1e-7');
CREATE TABLE M (key VARCHAR, json VARCHAR);
-- Train model for 'Y' on 'X' using parameters 'Z', store in 'M'
CREATE TEMP TABLE A AS SELECT * FROM mlpack_adaboost("X", "Y", "Z", "M");
-- Count by predicted group
SELECT COUNT(*) as n, predicted FROM A GROUP BY predicted;
-- Model 'M' can be used to predict
CREATE TABLE N (x1 DOUBLE, x2 DOUBLE, x3 DOUBLE, x4 DOUBLE);
-- inserting approximate column mean values
INSERT INTO N VALUES (5.843, 3.054, 3.759, 1.199);
-- inserting approximate column mean values, min values, max values
INSERT INTO N VALUES (5.843, 3.054, 3.759, 1.199), (4.3, 2.0, 1.0, 0.1), (7.9, 4.4, 6.9, 2.5);
-- and this predict one element each
SELECT * FROM mlpack_adaboost_pred("N", "M");
About mlpack
The mlpack extension allows to fit (or train) and predict (or classify) from the models implemented, currently adaBoost, random forests as well as (regularized) linear and logistic regression. The format is the same for all four methods: four tables, say, "X", "Y", "Z" and "M" provide input for, respectively, features "X", labels "Y", optional parameters varying by model in "Z" as well as an output table "M" for the JSON-serialized model. For all four methods, following a model fit (or training), a prediction (or classification) can be made using "M" and new predictor values "N" as shown in the example. All "fit" (or "train") methods take four parameter tables, all "predict" methods take two. A parmater "mlpack_verbose" can also be set.
The implementation continues to stress the 'minimal' part of 'a MVP demo'. It wraps four machine learning methods, and provides Linux and macOS builds. More methods, options or parameters can be added quite easily. As interfaces may change while we may work out how to automate interface generation from the mlpack-side, so it should be considered experimental.
For more, please see the repo.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| mlpack_adaboost_train | table | use adaboost to train and store model | parameters 'iterations', 'tolerance', 'perceptronIter' and 'silent' | NULL |
| mlpack_adaboost_pred | table | predict classification using stored adaboost stored model | NULL | NULL |
| mlpack_linear_regression_fit | table | fit and store linear regression model | parameters 'lambda', 'intercept' and 'silent' | NULL |
| mlpack_linear_regression_pred | table | predict using stored linear regression model | NULL | NULL |
| mlpack_logistic_regression_fit | table | fit and store logistic regression model | parameters 'lambda', 'intercept' and 'silent' | NULL |
| mlpack_logistic_regression_pred | table | predict classification using stored logistic regression model | NULL | NULL |
| mlpack_random_forest_train | table | use random forest to train and store model | parameters 'nclasses', 'ntrees', 'seed', 'threads' and 'silent' | NULL |
| mlpack_random_forest_pred | table | predict classification using stored random forest model | NULL | NULL |
Added Settings
| name | description | input_type | scope | aliases |
|---|---|---|---|---|
| mlpack_silent | Toggle whether to operate in silent mode, default is false | BOOLEAN | GLOBAL | [] |
| mlpack_verbose | Toggle whether to operate in verbose mode, default is false | BOOLEAN | GLOBAL | [] |