Welcome to Cooka

Cooka is a lightweight and visualization system to manage datasets and design model learning experiments through web UI.

It using DeepTables and HyperGBM as experiment engine to complete feature engineering, neural architecture search and hyperparameter tuning automatically.

../../static/datacanvas_automl_toolkit.png

Features overview

Through the web UI provided by cooka you can:

  1. Add and analyze datasets

  2. Design experiment

  3. View experiment process and result

  4. Using models

  5. Export experiment to jupyter notebook

Screen shots

_images/cooka_home_page.png _images/cooka_train.gif

The machine learning algorithms supported are :

  • XGBoost

  • LightGBM

  • Catboost

The neural networks supported are:

  • WideDeep

  • DeepFM

  • xDeepFM

  • AutoInt

  • DCN

  • FGCNN

  • FiBiNet

  • PNN

  • AFM

The search algorithms supported are:

  • Evolution

  • MCTS(Monte Carlo Tree Search)

The supported feature engineering provided by scikit-learn and featuretools are:

  • Scaler
    • StandardScaler

    • MinMaxScaler

    • RobustScaler

    • MaxAbsScaler

    • Normalizer

  • Encoder
    • LabelEncoder

    • OneHotEncoder

    • OrdinalEncoder

  • Discretizer
    • KBinsDiscretizer

    • Binarizer

  • Dimension Reduction
    • PCA

  • Feature derivation
    • featuretools

  • Missing value filling
    • SimpleImputer

It can also extend the search space to support more feature engineering methods and modeling algorithms.

Read more:

Installation

You can use docker,pip and source code to install Cooka.

Using pip

It requires Python3.6 or above, uses pip to install Cooka:

pip install --upgrade pip setuptools # (optional)
pip install cooka

then start cooka web server:

cooka server

Open browser visit site http://<your-ip>:8000 to use Cooka. If you want to integrate with jupyter notebook please refer to this guide.

Using Docker

You can also use Cooka through docker with command:

docker run -ti -p 8000:8000 -p 9001:9001 datacanvas/cooka:latest
# port 9001 is supervisor(which used to manage process) web ui, and the account/password is: user/123 
# port 8000 is cooka web ui

Open browser and visit site http://<your-ip>:8000 to use cooka.

If you want to integrate with jupyter notebook, please specify jupyter url running in the container:

docker run -ti -p 8000:8000 -p 9001:9001 -p 8888:8888 -e COOKA_NOTEBOOK_PORTAL=http://<your_ip>:8888 datacanvas/cooka:latest
# port 8888 is jupyter notebook

You can persist data in the host:

docker run -v /path/to/cooka-config-dir:/root/.config/cooka -v /path/to/cooka-data:/root/cooka -ti -p 8000:8000 -p 9001:9001 datacanvas/cooka:latest
# Config file is at: /root/.config/cooka/cooka.py
# User data is at: /root/cooka

Using source code

Frontend developed by reactjs, therefore, we need to install node>=8.0.0 get it at https://nodejs.org and install yarn:

npm install yarn -g

Finally build frontend and install them all:

pip install --upgrade pip setuptools
git clone git@github.com:DataCanvasIO/Cooka.git

cd Cooka
python setup.py buildjs  # build frontend
python setup.py install

User Guide

The purpose of this document is to help you to use Cooka in web browser, including:

  • Manage dataset

  • Preview dataset

  • Insight dataset

  • Design experiment

  • Experiment list

We recommend Chrome v59 or above to visit Cooka.

Manage dataset

You can upload or import data from the server for training and the data file should meet the following conditions:

  1. At least 2 columns

  2. In CSV format

  3. With headers or not

You can also choose the sampling strategy to analyze the dataset, supported sampling strategies:

  • by rows

  • by percentage

  • whole data

Datasets can be added in the dataset list page:

_images/cooka_dataset_home.pngimage

Upload

Users can upload files to Cooka through the browser to create datasets:

_images/cooka_dataset_upload.pngimage

Import

Users can import files in server to Cooka. This way is friendly to large files, import process:

  1. Input file path in server and wait it check passed

  2. Click “Analyze” button, cooka will do analyze and the progress informantion should be displayed in the bar on the right

  3. Check dataset name and click “Create” button to confirm

_images/cooka_dataset_import.pngimage

Preview dataset

You can view CSV file data in table view in “Preview” page:

_images/cooka_dataset_preview.pngimage

Insight dataset

Cooka will analyze the information of dataset, including:

  • distribution of different feature types

  • data type, feature type, unques, missing percentage, linear correlation of feature

_images/cooka_dataset_insight.pngimage

It will show mode and numerical distribution for categorical features:

_images/cooka_dataset_categorical.pngimage

For continuous features, you can view the min/max/mean/median/standard deviation and the values distribution or interval distribution:

_images/cooka_dataset_continuous.pngimage

The values distribution:

_images/cooka_dataset_continuous_2.pngimage

For datetime feature you can see statistics by year, month, day, hour, week:

_images/cooka_dataset_datetime.pngimage

Cooka can note the poor quality feature and the reason, for example:

_images/cooka_dataset_missing.pngimage

The reason may be:

  1. Correlation is too low

  2. Missing percentage is too high

  3. Constant feature

  4. Id-ness feature(Every value is unique)

Design experiment

Users can design modeling experiments to define real-life problems as a modeling task. On the data exploration page, users can select one as target column:

_images/cooka_experiment_design.pngimage

In the experiment design page, you can choose quick training mode or performance training mode. Quick mode uses general search space and less search times, while performance mode uses more comprehensive search space and more search times. Fast mode makes a balance between training time and model effect, while performance mode sacrifices time to improve model effect:

_images/cooka_experiment_design_1.pngimage

It will infer task type by the target column. There is 2 experiments engine HyperDT and HyperGBM, and they use neural network and GBM algorithms. If your data is in datatime order, you can select datetime series column, cooka will use the older data to train and the newer data to test model effect:

_images/cooka_experiment_design_2.pngimage

Experiment list

You can see the running status of the training task on the experiment list page, such as training progress and model score:

_images/cooka_experiment_list.pngimage

Cooka supports early stopping, when the performance of the model can not be improved, it will terminate the training process in advance to save computing resources.

Model evaluation

For the completed experiments, you can see the evaluation of model including:

  1. Confusion matrix and ROC curve for binary-classification

  2. Evaluaion metrics

_images/cooka_experiment_evaluation.pngimage

The ROC curve:

_images/cooka_experiment_roc.pngimage

In the process of searching, Y-axes values represent the hyper parameters used in training, and the color of lines represents the score of the model, the darker the color, the higher the score:

_images/cooka_experiment_optimaize.pngimage

Model predict

The model will be saved in the training end, Users can upload test data to use the model, and the prediction progress will be displayed on the page:

_images/cooka_experiment_predict.pngimage

The results can be downloaded when prediction finished.

Export to notebook

You can export experiment to notebook to custom modeling:

_images/cooka_experiment_notebook_1.pngimage

An example of Notebook:

_images/cooka_experiment_notebook_2.pngimg

Notebook also contains explanation of prediction:

_images/cooka_experiment_notebook_3.pngimg

And feature importance:

_images/cooka_experiment_notebook_4.pngimg

If it is a binary classification task, the ROC curve and confusion matrix of the model will also be included.

Configuration

Configuration file

Cooka provides a command to generate config template:

 cooka generate-config

# Configuration file for cooka

# HTTP Server port
# c.CookaApp.server_port = 8000

# Language, zh_CN or en_US, auto; if is auto will read localization from use browser
# c.CookaApp.language = "auto"

# Data to storage
# c.CookaApp.data_directory = "~/cooka"

# Integrate with jupyter, Jupyter notebook work dir should at `c.CookaApp.data_directory`
# c.CookaApp.notebook_portal = "http://localhost:8888"

# Default optimize metric
# c.CookaApp.optimize_metric = {
#     "multi_classification_optimize": "accuracy",
#     "binary_classification": "auc",
#     "regression": "rmse"
# }

# Default trial nums
# c.CookaApp.max_trials = {
#     "performance": 50,
#     "quick": 10,
#     "minimal": 1
# }

Write it to configuration file at ~/.config/cooka/cooka.py:

mkdir -p ~/.config/cooka/
cooka generate-config > ~/.config/cooka/cooka.py

Integrate with Jupyter Notebook

1. Install dependencies

In the experiment list, you can export the experiment to notebook: _images/export_to_notebook.png

It needs python module:

  • shap: model explain

  • jupyterlab: notebook server

  • matplotlib: plot in notebook

You may refer to this guide to install shap;

Install jupyterlab using pip:

pip install jupyterlab

matplotlib dependency on system package graphviz take install it on centos7 as an example:

yum install graphviz

and then install matplotlib using pip:

pip install matplotlib

2. Start jupyter

Start a jupyterab server in cooka working directory, default it at ~/cooka:

cd ~/cooka
jupyter-lab --ip=0.0.0.0 --no-browser --allow-root --NotebookApp.token= 

3. Configure cooka

Then configuration notebook portal in cooka config file ~/.config/cooka/cooka.py:

c.CookaApp.notebook_portal = "http://<change_to_you_jupyter_ip>:8888"

Finally, start the web server and try to export a experiment to notebook:

cooka server

Release Note

Version 0.1.2

Experiment design - Support input random state to split dataset

Experiment list - Use reward metric in visualmap of hyperparams line chart(Fix bug)

** Other ** - Update hypergbm to 0.2.2

Version 0.1.1

Dataset manage

  • Search

  • Delete

  • Upload or import CSV
    • Sampling analysis

    • Support no column headers

    • Inferring feature types

Dataset preview

  • Cat origin file on line

  • Scrolling

Dataset insight

  • Distribution of feature type

  • Data type, feature type, missing percentage, uniques, linear correlation

  • Recognize Id-ness, constant, missing percentage too high features

  • Feature search

  • Datetime features
    • Display by year, month, day, hour, week

  • Categorical features
    • Distribution of values

    • Mode

  • Continuous features
    • Distribution of interval

    • Distribution of values

    • max, min, median, mean, stand deviation

Experiment design

  • Recommend experiment options

  • HyperGBM,HyperDT as experiment engine

  • Quick, performance training mode

  • Train-Validation-Holdout data partition

  • Split data in datetime order

  • Support binary classification, multi-classification, regression

Experiment list

  • Training process,

  • Remaining time estimation

  • Confusion matrix and ROC curve for binary-classification

  • Evaluation metrics
    • Binary classification: Accuracy, F1, Fbeta, Precision, Recall, AUC, Log Loss

    • Multi-classification: Accuracy, F1, Fbeta, Precision, Recall, Log Loss

    • Regression: EVS, MAE, MSE, RMSE, MSLE, R2, MedianAE

  • View train log and source code

  • Export to notebook

  • Hyper-params

  • Batch predict

DataCanvas

_images/dc_logo_1.png

Cooka is an open source project created by DataCanvas .