Overview

Open Business and Artificial Intelligence Connectivity (OBAIC) borrows the concept from Open Database Connectivity (ODBC), which is an interface that makes it possible for applications to access data from a variety of database management systems (DBMSs). The aim of OBAIC is to make it as the interface that makes it possible for BI tools to access machine learning model from a variety of AI platform - “AI ODBC for BI”
Through OBAIC, BI vendors can connect to any AI platform freely without concerning about the underlying implementation and how does the AI platform execute train the model or infer the result. It's just like what we used to have for database with ODBC that it's up to how the database store the data and execute the query.
The committee has decided this standard will only define the REST APIs protocol of how AI and BI communicates, initiated from BI to AI. The design or the actual implementation of OBAIC, such as whether this should be Server VS Server-less VS Docker, will leave it up to the vendor to provide, or if this protocol grows to another open-sourced project to provide such implementation.
There are 3 key aspects when designing this standard
- BI - what specific call do I need this standard to provide so that I can better leverage any underlying AI/ML framework?
- AI - what should be the common denominator an AI framework should provide to support this standard?
- Data - Shall data be moved around in the communication between AI and BI (passed by value) or keep the data in the same location (passed by reference)?

Scope

We understand that there are 2 key steps in machine learning - Model Training and Result Inference. In this first release of this protocol, we will only focus on inference. Training is provided here but it's subjected to more discussion.

Overall Flow

OBAIC overall flow

REST APIs

All of the REST APIs call presented below use bearer tokens for authorization. The {prefix} of each API is configurable in the hosted servers. This protocol is inspired by Delta Sharing.

List Models - Step (1)

HTTP Request	Value
Method	`GET`
Header	`Authorization: Bearer {token}`
URL	`{prefix}/models/{model}`
Query Parameters	maxResults (type: Int32, optional): The maximum number of results per page that should be returned. If the number of available results is larger than `maxResult`, the response will provide a `nextPageToken` that can be used to get the next page of results in subsequent list requests. The server may return fewer than `maxResults` items even if there are more available. The client should check `nextPageToken` in the response to determine if there are more available. Must be non-negative. 0 will return no results but `nextPageToken` may be populated. pageToken (type: String, optional): Specifies a page token to use. Set `pageToken` to the `nextPageToken` returned by a previous list request to get the next page of results. `nextPageToken` will not be returned in a response if there are no more results available.

HTTP Response Value

Header Content-Type: application/json; charset=utf-8

Body

{
"items": [
{
"name": "string",
"id": "string"
}
],
"nextPageToken": "string"
}

items will be an empty array when no results are found.
id field is the key to retrieved the model in the subsequent calls. Its value must be unique across the AI server and immutable through the model's lifecycle.
nextPageToken will be missing when there are no additional results

Example:

{
   "models": [
      {
         "name": "Model 1",
         "id": "6d4b571a-80ca-41ef-bc67-b158f4352ad8"   
      },
      {
         "name": "Model 2",
         "id": "70d9ab9d-9a64-49a8-be4d-d3a678b4ab16"
      },
      {
         "name": "Model 3",
         "id": "99914a97-5d2e-4b9f-b81a-1d43c9409162"
      },
      {
         "name": "Model 4",
         "id": "8295bfda-7901-43e8-9d31-81fd1c3210ee"
      },
      {
         "name": "Model 5",
         "id": "0693c224-3a3f-4fe7-bbbe-c70f93d15f12"
      }
   ],
   "nextPageToken": "3xXc4ZAsqZQwgejt"
}

Get Model - Step (2)

HTTP Request	Value
Method	`GET`
Header	`Authorization: Bearer {token}`
URL	`{prefix}/model/{modelID}`
URL Parameters	{modelID}: The case-insensitive ID of the model returned in in List Models for Step (1)

HTTP Response Value

Header

Content-Type: application/json; charset=utf-8

Body

{
"id": "string",
"name": "string",
"revision": "int",
"format": {
"name": "string",
"version": "string"
},
"algorithm": "string", // Artificial neural network | Decision trees | Support-vector machines | Regression analysis | Bayesian networks | Genetic algorithms | Proprietary
"tags": ["string"],
"dependency": "string",
"creator": "string",
"description": "string",
"input": {
"fields": [
{
"name": "string",
"opType": "string",
"dataType": "string",
"taxonomy": "string",
"example": "string",
"allowMissing": "boolean",
"description": "string"
}, ...
],
"$ref": "string"
},
"output": {
"fields": [
{
"name": "string",
"opType": "string",
"dataType": "string",
"taxonomy": "string",
"example": "string",
"allowMissing": "boolean",
"description": "string"
}, ...
],
"$ref": "string"
},
"performance": {
"metric": "string",
"value": "float"
},
"rating": "int",
"url": "string"
}

format.name: PMML, ONNX, or other formats to be confirmed
algorithm: Artificial neural network | Decision trees | Support-vector machines | Regression analysis | Bayesian networks | Genetic algorithms | Proprietary
tags: describe what this model is used for e.g. Agriculture | Banking | Computer vision | Credit-card fraud detection | Handwriting recognition | Insurance | Machine translation | Marketing | Natural language processing | Online advertising | Recommender systems | Sentiment analysis | Telecommunication | Time-series forecasting | etc.
opType: categorical | ordinal | continuous
dataType: string | integer | float | double | boolean | date | time | dateTime
$ref: reference to external schema for the format used
metric: based on model used, metric can be accuracy, precision, recall, ROC, AUC, Gini coefficient, Log loss, F1 score, MAE, MSE, etc.
url: link to the real model for download

Example:

{
    "id": "6d4b571a-80ca-41ef-bc67-b158f4352ad8",
    "name": "Model 1",
    "revision": 3,
    "format": { 
      "name": "PMML",
      "version": "4.3"
    },
    "algorithm": "Neural Network", 
    "tags": [
      "Anomaly detection",         
      "Banking"                    
    ],                              
    "dependency", "",
    "creator": "John Doe",
    "description": "This is a predictive model, refer to {input} and {output} for detailed format of each field, such as value range of a field, as well as possible predictions the model will gave. You may also refer to the example data here.",
    "input": {
      "fields": [
        {
          "name": "Account ID",
          "opType": "categorical",
          "dataType": "string",
          "taxonomy": "ID",
          "example": "account abc-001",
          "allowMissing": false,
          "description": "unique value"
        },
        {
          "name": "Account Balance",
          "opType": "continuous",
          "dataType": "double",
          "taxonomy": "currency",
          "example": "1,378,560.00",
          "allowMissing": true,
          "description": "Minimum: 0, Maximum: 999,999,999.00"
        }, 
      ],
      "ref": "http://dmg.org/pmml/v4-3/pmml-4-3.xsd"                                                       
    }
    "output": {
      "fields": [
        {
          "name": "Churn",
          "opType": "continuous",
          "dataType": "string",
          "taxonomy": "ID",
          "example": "0.67",
          "allowMissing": false,
          "description": "the possibility of the account stop doing business with a company over 6 months"
        }
      ],
      "ref": "http://dmg.org/pmml/v4-3/pmml-4-3.xsd"                                                       
    }
    "performance": {            
      "metric": "accuracy",     
      "value": 0.85
    },
    "rating": 5,
    "url": "uri://link_to_the_model"  
}

Error - Apply to all API calls above

HTTP Response	Value
Header	`Content-Type: application/json`
Body	`{` `"errorCode": "string",` `"message": "string"` `}`

HTTP Response	Value
Header	`Content-Type: application/json`
Body	`{` `"errorCode": "string",` `"message": "string"` `}`

HTTP Response	Value
Header	`Content-Type: application/json`
Body	`{` `"errorCode": "string",` `"message": "string"` `}`

HTTP Response	Value
Header	`Content-Type: application/json`
Body	`{` `"errorCode": "string",` `"message": "string"` `}`

Authors

Potential Future Enhancement

Define data pipeline to transform data before running
Define containerized model so that prediction can run in BI instead of in AI
Define format of nextPageToken
Define different types of errorCode and message for each API call

References

Tableau version of OBAIC https://tableau.github.io/analytics-extensions-api/docs/ae_example_tabpy.html
Qlik version of OBAIC: https://github.com/qlik-oss/server-side-extension
Delta Sharing: https://github.com/delta-io/delta-sharing/blob/main/PROTOCOL.md#delta-sharing-protocol

Decision to be made

Data file type: What type of data we are supporting: e.g. for Delta needs to be parquet, RDBMS? Can modify the Jeffrey init cut below to support multiple data types, depending on the use case.
- Inference: Pass by value should be good enough if it's only for predicting
- Train: not immediate, maybe later in Phase 2
Metadata structure, what kind of JSON schema do we need
Do we only support a specific model type (ONNX) or arbitrary number of framework
Decouple model (asking the model to predict and train) and data (listing, upload, download)
Finalize Logo

FAQ

Why should I share our model to you?

Ownership? Model and Data?

Security?

How can the data be accessed mechanically, for training?

This is a short doc illustrating a sample skeleton OBAIC protocol. This proposal envisions a data-centric workflow:

BI vendor has some data on which predictive analytics would be valuable.
BI vendor requests AI vendor (through OBAIC) to train/prepare a model that accepts features of a certain type (numeric, categorical, text, etc.)
BI vendor gives AI vendor a token to allow access to the training data with the above features. A SQL statement is a natural way to specify how to retrieve data from the datastore.
When model is trained, BI vendor can see the results of training (e.g., accuracy).
AI vendor provides predictions on data shared by BI vendor, again using an access token.

API

Train a New Model

function TrainModel(inputs, outputs, modelOptions, dataConfig) -> UUID

Example params:

{
"inputs":[
{
"name":"customerAge",
"type":"numeric"
},
{
"name":"activeInLastMonth",
"type":"binary"
}
],
"outputs":[
{
"name":"canceledMembership",
"type":"binary"
}
],
"modelOptions": {

“providerSpecificOption”: “value”

},
"data":{
"sourceType":"snowflake",
"endpoint":"some/endpoint",
"bearerToken":"...",
"query":"SELECT foo FROM bar WHERE baz"
}
}

Model configuration is based on configs from the open-source Ludwig project. At a minimum, we should be able to define inputs and outputs in a fairly standard way. Other model configuration parameters are subsumed by the options field.

The data stanza provides a bearer token allowing the ML provider to access the required data table(s) for training. The provided SQL query indicates how the training data should be extracted from the source.

Example response:

{
"modelUUID":"abcdef0123"
}

Consider also a fully SQL-like interface taking BigQuery ML model creation as an example and generalizing:

CREATE MODEL (
customerAge WITH ENCODING (
type=numeric
),
activeInLastMonth WITH ENCODING (
type=binary
),
canceledMembership WITH DECODING (
type=binary
)
)
FROM myData (
sourceType=snowflake,
endpoint="some/endpoint",
bearerToken=<...>,
)

AS (SELECT foo FROM BAR)
WITH OPTIONS ();

List Models

function ListModels() -> List[UUID, Status]

Example response:

{
"models":[
{ "modelUUID": "abcdef0123", "status": "deployed" },
{ "modelUUID": "1234567890", "status": "training" }
]
}

Show Model Config

function GetModelConfig(UUID) -> Config

Example response:

{
"inputs":[
{
"name":"customerAge",
"type":"numeric"
},
{
"name":"activeInLastMonth",
"type":"binary"
}
],
"outputs":[
{
"name":"canceledMembership",
"type":"binary"
}
],
"modelOptions": {},
"data":{
"sourceType":"snowflake",
"query":"SELECT foo FROM bar WHERE baz"
}
}

The response here is essentially a pared-down version of the original training configuration.

Get Model Status

function GetModelStatus(UUID) -> Status

Example response:

{
"status": "errored",
"message": "Failed to train"
}

Get Model Metrics

Get core evaluation metrics for a trained model.

function GetModelMetrics(UUID) -> Metrics

Example response:

{
"accuracy":0.781,
"lossType":"cross-entropy",
"loss":0.0238
}

Predict Using Trained Model

function PredictWithModel(UUID, dataConfig) -> Predictions

Example params

{
"uuid": "abcdef12345",
"data":{
"sourceType":"snowflake",
"endpoint":"some/endpoint",
"bearerToken":"...",
"query":"SELECT foo FROM bar WHERE baz"
}
}

A very similar data stanza to the train request, designating the feature data on which to predict.

Example response (as JSON here for convenience, not necessarily for large responses):

{
"data":[
{
"customerAge":2,
"activeInLastMonth":"false",
"predicted__canceledSubscription":"true"
},
{
"customerAge":9,
"activeInLastMonth":"true",
"predicted__canceledSubscription":"false"
}
]
}

Note that directly returning a large response set is not a good idea. In practice, the results could be streamed through something like a persistent socket connection.