Production: Using Artifacts

Any item registered in Corridor - be it a Data Element, Feature, Model, or Policy - can be extracted out to be run in a separate environment. The artifact can run independent of Corridor in an isolated runtime environment or production environment.

The artifact aims to:

Extract the logic/items registered in Corridor - to then use it outside Corridor
Self sufficient with all information encapsulated in the artifact
Have minimal dependencies on the runtime-environment where the artifact is run later

An artifact is created when an item is approved in Corridor. For example: when a Data Element, Feature, Model, Policy is finalized (Approved) and cannot be edited anymore.

Contents of an artifact

The artifact contains all metadata information as well as dependencies needed to solve the item registered. It exposes a python-library or a python-function which can be used to run the end-to-end logic to create the item that was exported - from the initial tables in Table Registry.

The artifact structure is similar for all types of items - be it a Data Element, Feature, Model, or a Policy. Here, we take an example of a Model artifact.

The artifact consists of the following files:

model_a.b.c
├── metadata.json
├── input_info.json
├── ... (additional information about features etc. used)
├── python_dict
|     ├── __init__.py
|     ├── versions.json
|     └── Additional information
└── pyspark_dataframe
      ├── __init__.py
      ├── versions.json
      └── Additional information

The metadata.json contains metadata information about the folder it is in. It will have information about the model, it's inputs, it's dependent-variable, etc. It also has any other metadata information registered in the platform like Groups, Permissible Purpose, etc..

The versions.json contains the versions of libraries that were used during the artifact creation - python version, any ML libraries, etc.

The input_info.json contains the input data tables needed to be sent to the artifact's main() function. The input always has to be sent in the form of:

{"table-ref1": <DATA-TABLE-1>,
 "table-ref2": <DATA-TABLE-2>}

The __init__.py file inside pyspark_dataframe and python_dict folders contain the end-to-end python function which can be used to run the entire artifact. They support different execution engines:

Batch execution with PySpark (pyspark_dataframe)
API execution in a Python environment (python_dict)

To run the artifact, simply call the main() function in the artifact with the needed data.
The python_dict/__init__.py contains a main() function into which data can be sent - in the form of a
python-dict for low-latency execution. A data table in the dictionary format is described as a dict
with type/values. For example:

{
    "col-alias1": {"type": "float", "values": [1.0]},
    "col-alias2": {"type": "str", "values": ["abc"]},
    "col-alias3": {"type": "datetime", "values": [datetime(year=2019, month=1, day=1)]},
}

The pyspark_dataframe/__init__.py contains a main() function into which data can be sent - in the form
of a pyspark-dataframe for execution on large data. A data table in spark is described by a
Spark DataFrame.

Example: Calling python_dict artifact

This is an example of how to call the python-dict artifact:

import sys
sys.path.insert(0, 'policy_1.0.1/')

from python_dict import main

in_data = {
    'data_table_1': {
        'id': {'type': 'str', 'values': ['1750036']},
        'annual_income': {'type': 'float', 'values': [50000.0]},
        'fico': {'type': 'float', 'values': [676.0]},
        'requested_amount': {'type': 'float', 'values': [30000.0]},
        'create_time': {'type': 'str', 'values': ['2019-06-29 04:34:00']},
        'state': {'type': 'str', 'values': ['TX']},
        # ...
    }
}

out_data, out_errs = main(in_data)
errors = out_errs()
if len(dict(errors).get('errors', [])) > 0:
    print("Runtime error occurred while running policy:")
    print(str(errors))
else:
    print("Policy Output:")
    print(out_data['final_data'])

Example: Calling pyspark_dataframe artifact

This is an example of how to call pyspark-dataframe artifact:

import sys
sys.path.insert(0, 'policy_1.0.1/')

from pyspark_dataframe import main

in_data = {
    'data_table_1': spark.read.csv("/data/application_jan2020.parquet")
}

out_data, out_errs = main(in_data)
errors = out_errs()
if len(dict(errors).get('errors', [])) > 0:
    print("Runtime error occurred while running policy:")
    print(str(errors))
else:
    print("Policy Output:")
    print(out_data['final_data'])