Overview

Scope

High Level Protocol - Training

  1. BI tool has some data on which predictive analytics would be valuable. 
  2. BI tool requests AI platform, through OBAIC, to train/prepare a model that accepts features of a certain type (numeric, categorical, text, etc.) by providing a token to allow access to the training data with a SQL statement running against the datastore.
  3. BI tool polls for the status/result of the training. When training is completed, results and performance will be returned.
  4. AI vendor provides predictions on data shared by BI vendor, again using an access token.

High Level Protocol - Inference

High Level Protocol - Inference

REST APIs

All of the REST APIs call presented below use bearer tokens for authorization. The {prefix} of each API is configurable in the hosted servers. This protocol is inspired by Delta Sharing.

List Models - Step (1)


HTTP RequestValue
Method

GET

Header

Authorization: Bearer {token}

URL

{prefix}/models/{model}

Query Parameters

maxResults (type: Int32, optional): The maximum number of results per page that should be returned. If the number of available results is larger than maxResult, the response will provide a nextPageToken that can be used to get the next page of results in subsequent list requests. The server may return fewer than maxResults items even if there are more available. The client should check nextPageToken in the response to determine if there are more available. Must be non-negative. 0 will return no results but nextPageToken may be populated.

pageToken (type: String, optional): Specifies a page token to use. Set pageToken to the nextPageToken returned by a previous list request to get the next page of results. nextPageToken will not be returned in a response if there are no more results available.




HTTP ResponseValue
HeaderContent-Type: application/json; charset=utf-8
Body

{
  "items": [
    {
      "name": "string",
      "id": "string"
    }
  ],
  "nextPageToken": "string"
}

  • items will be an empty array when no results are found. 
  • id field is the key to retrieved the model in the subsequent calls. Its value must be unique across the AI server and immutable through the model's lifecycle.
  • nextPageToken will be missing when there are no additional results


Example:

{
   "models": [
      {
         "name": "Model 1",
         "id": "6d4b571a-80ca-41ef-bc67-b158f4352ad8"   
      },
      {
         "name": "Model 2",
         "id": "70d9ab9d-9a64-49a8-be4d-d3a678b4ab16"
      },
      {
         "name": "Model 3",
         "id": "99914a97-5d2e-4b9f-b81a-1d43c9409162"
      },
      {
         "name": "Model 4",
         "id": "8295bfda-7901-43e8-9d31-81fd1c3210ee"
      },
      {
         "name": "Model 5",
         "id": "0693c224-3a3f-4fe7-bbbe-c70f93d15f12"
      }
   ],
   "nextPageToken": "3xXc4ZAsqZQwgejt"
}

Get Model - Step (2)


HTTP RequestValue
MethodGET
HeaderAuthorization: Bearer {token}
URL{prefix}/model/{modelID}
URL Parameters{modelID}: The case-insensitive ID of the model returned in in List Models for Step (1)




HTTP ResponseValue
Header

Content-Type: application/json; charset=utf-8

Body

{
  "id": "string",
  "name": "string",
  "revision": "int",
  "format": {
    "name": "string", 
    "version": "string"
  },
  "algorithm": "string", // Artificial neural network | Decision trees | Support-vector machines | Regression analysis | Bayesian networks | Genetic algorithms | Proprietary
  "tags": ["string"],
  "dependency": "string",
  "creator": "string",
  "description": "string",
  "input": {
    "fields": [
      {
        "name": "string",
        "opType": "string",
        "dataType": "string",
        "taxonomy": "string",
        "example": "string",
        "allowMissing": "boolean",
        "description": "string"
      }, ...
    ],
    "$ref": "string"
  },
  "output": {
    "fields": [
      {
        "name": "string",
        "opType": "string",
        "dataType": "string",
        "taxonomy": "string",
        "example": "string",
        "allowMissing": "boolean",
        "description": "string"
      }, ...
    ],
    "$ref": "string"
  },
  "performance": {
    "metric": "string",
    "value": "float"
  },
  "rating": "int",
  "url": "string"
}

  • format.name: PMML, ONNX, or other formats to be confirmed
  • algorithm: Artificial neural network | Decision trees | Support-vector machines | Regression analysis | Bayesian networks | Genetic algorithms | Proprietary
  • tags: describe what this model is used for e.g. Agriculture | Banking | Computer vision | Credit-card fraud detection | Handwriting recognition | Insurance | Machine translation | Marketing | Natural language processing | Online advertising | Recommender systems | Sentiment analysis | Telecommunication | Time-series forecasting | etc.
  • opType: categorical | ordinal | continuous
  • dataType: string | integer | float | double | boolean | date | time | dateTime
  • $ref: reference to external schema for the format used
  • metric: based on model used, metric can be accuracy, precision, recall, ROC, AUC, Gini coefficient, Log loss, F1 score, MAE, MSE, etc.
  • url: link to the real model for download


Example:

{
    "id": "6d4b571a-80ca-41ef-bc67-b158f4352ad8",
    "name": "Model 1",
    "revision": 3,
    "format": { 
      "name": "PMML",
      "version": "4.3"
    },
    "algorithm": "Neural Network", 
    "tags": [
      "Anomaly detection",         
      "Banking"                    
    ],                              
    "dependency", "",
    "creator": "John Doe",
    "description": "This is a predictive model, refer to {input} and {output} for detailed format of each field, such as value range of a field, as well as possible predictions the model will gave. You may also refer to the example data here.",
    "input": {
      "fields": [
        {
          "name": "Account ID",
          "opType": "categorical",
          "dataType": "string",
          "taxonomy": "ID",
          "example": "account abc-001",
          "allowMissing": false,
          "description": "unique value"
        },
        {
          "name": "Account Balance",
          "opType": "continuous",
          "dataType": "double",
          "taxonomy": "currency",
          "example": "1,378,560.00",
          "allowMissing": true,
          "description": "Minimum: 0, Maximum: 999,999,999.00"
        }, 
      ],
      "ref": "http://dmg.org/pmml/v4-3/pmml-4-3.xsd"                                                       
    }
    "output": {
      "fields": [
        {
          "name": "Churn",
          "opType": "continuous",
          "dataType": "string",
          "taxonomy": "ID",
          "example": "0.67",
          "allowMissing": false,
          "description": "the possibility of the account stop doing business with a company over 6 months"
        }
      ],
      "ref": "http://dmg.org/pmml/v4-3/pmml-4-3.xsd"                                                       
    }
    "performance": {            
      "metric": "accuracy",     
      "value": 0.85
    },
    "rating": 5,
    "url": "uri://link_to_the_model"  
}

Error - Apply to all API calls above


HTTP ResponseValue
HeaderContent-Type: application/json
Body{
"errorCode": "string",
"message": "string"
}




HTTP ResponseValue
HeaderContent-Type: application/json
Body{
"errorCode": "string",
"message": "string"
}




HTTP ResponseValue
HeaderContent-Type: application/json
Body{
"errorCode": "string",
"message": "string"
}




HTTP ResponseValue
HeaderContent-Type: application/json
Body{
"errorCode": "string",
"message": "string"
}


Nest Step

Potential Future Enhancement

FAQ

  1. Why should AI share model to BI?
  2. Who owns the model and data?
  3. How do you deal with Security?

References

Authors

Name

Affiliation

Cupid ChanPistevo Decision
Xiangxiang MengRedfin
Deepak KaruppiahMicroStrategy
Nancy RauschSAS
Dalton RuerQlik
Sachin SinhaMicrosoft
Yi ShaoIBM
Jeffrey TangPredibaes
Lingyan YinSalesforce



Train a New Model

function TrainModel(inputs, outputs, modelOptions, dataConfig) -> UUID


Example params:

{
  "inputs":[
      {
        "name":"customerAge",
        "type":"numeric"
      },
      {
        "name":"activeInLastMonth",
        "type":"binary"
      }
  ],
  "outputs":[
      {
        "name":"canceledMembership",
        "type":"binary"
      }
  ],
  "modelOptions": {

      “providerSpecificOption”: “value”

   },
  "data":{
      "sourceType":"snowflake",
      "endpoint":"some/endpoint",
      "bearerToken":"...",
      "query":"SELECT foo FROM bar WHERE baz"
  }
}


Model configuration is based on configs from the open-source Ludwig project. At a minimum, we should be able to define inputs and outputs in a fairly standard way. Other model configuration parameters are subsumed by the options field.

The data stanza provides a bearer token allowing the ML provider to access the required data table(s) for training. The provided SQL query indicates how the training data should be extracted from the source.

Example response:

{
  "modelUUID":"abcdef0123"
}


Consider also a fully SQL-like interface taking BigQuery ML model creation as an example and generalizing:

CREATE MODEL (
  customerAge WITH ENCODING (
    type=numeric
  ),
  activeInLastMonth WITH ENCODING (
    type=binary
  ),
  canceledMembership WITH DECODING (
    type=binary
  )
)
FROM myData (
  sourceType=snowflake,
  endpoint="some/endpoint",
  bearerToken=<...>,

AS (SELECT foo FROM BAR)
WITH OPTIONS ();

List Models

function ListModels() -> List[UUID, Status]


Example response:

{
  "models":[
    { "modelUUID": "abcdef0123", "status": "deployed" },
    { "modelUUID": "1234567890", "status": "training" }
  ]
}

Show Model Config

function GetModelConfig(UUID) -> Config


Example response:

{
  "inputs":[
      {
        "name":"customerAge",
        "type":"numeric"
      },
      {
        "name":"activeInLastMonth",
        "type":"binary"
      }
  ],
  "outputs":[
      {
        "name":"canceledMembership",
        "type":"binary"
      }
  ],
  "modelOptions": {},
  "data":{
      "sourceType":"snowflake",
      "query":"SELECT foo FROM bar WHERE baz"
  }
}


The response here is essentially a pared-down version of the original training configuration.

Get Model Status

function GetModelStatus(UUID) -> Status


Example response:

{
  "status": "errored",
  "message": "Failed to train"
}


Get Model Metrics

Get core evaluation metrics for a trained model.

function GetModelMetrics(UUID) -> Metrics


Example response:

{
  "accuracy":0.781,
  "lossType":"cross-entropy",
  "loss":0.0238
}


Predict Using Trained Model

function PredictWithModel(UUID, dataConfig) -> Predictions


Example params

{
  "uuid": "abcdef12345",
  "data":{
      "sourceType":"snowflake",
      "endpoint":"some/endpoint",
      "bearerToken":"...",
      "query":"SELECT foo FROM bar WHERE baz"
  }
}

    

A very similar data stanza to the train request, designating the feature data on which to predict.

Example response (as JSON here for convenience, not necessarily for large responses):

{
  "data":[
      {
        "customerAge":2,
        "activeInLastMonth":"false",
        "predicted__canceledSubscription":"true"
      },
      {
        "customerAge":9,
        "activeInLastMonth":"true",
        "predicted__canceledSubscription":"false"
      }
  ]
}


Note that directly returning a large response set is not a good idea. In practice, the results could be streamed through something like a persistent socket connection.