Model

WekaModel Configuration

ALGORITHM 1. JRip

No.

Options

Description

1

-F <number of folds>

  • Set number of folds for REP One-fold is used as pruning set

  • (Default 3)

2

-N <min. weights>

  • Set the minimal weights of instances within a split

  • (Default 2.0)

3

-O <number of runs>

  • Set the number of runs of optimizations

  • (Default: 2)

4

-D

  • Set whether turn on the debug mode

  • (Default: false)

5

-S <seed>

  • The seed of randomization

  • (Default: 1)

6

-E

  • Whether NOT check the error rate>=0.5 in stopping criteria

  • (Default: check)

7

-P

  • Whether NOT use pruning

  • (Default: use pruning)

ALGORITHM 2. Decision Table

No.

Options

Description

1

-S <search method specification>

  • Full class name of search method, followed by its options.eg: “weka.attributeSelection.BestFirst -D 1”

  • (Default weka.attributeSelection.BestFirst)

1.1

-P <start set>

Specify a starting set of attributes. E.g., 1,3,5-7

1.2

-D <0 = backward | 1 = forward | 2 = bi-directional>

Direction of search (default = 1)

1.3

-N <num>

Number of non-improving nodes to consider before terminating search

1.4

-S <num>

  • Size of lookup cache for evaluated subsets

  • Expressed as a multiple of the number of attributes in the data set. (default = 1)

2

-X <number of folds>

  • Use cross validation to evaluate features

  • Use number of folds = 1 for leave one out CV

  • (Default = leave one out CV)

3

-E <acc | rmse | mae | auc>

Performance evaluation measure to use for selecting attributes. (Default = accuracy for discrete class and rmse for numeric class)

4

-I

Use nearest neighbor instead of global table majority

5

-R

Display decision table rules

ALGORITHM 3. PART

No.

Options

Description

1

-C <pruning confidence>

  • Set confidence threshold for pruning

  • (Default 0.25)

2

-M <minimum number of objects>

  • Set minimum number of objects per leaf

  • (Default 2)

3

-R

Use reduced error pruning

4

-N <number of folds>

  • Set number of folds for reduced error pruning. One-fold is used as pruning set

  • (Default 3)

5

-B

Use binary splits only

6

-U

Generate unpruned decision list

7

-J

Do not use MDL correction for info gain on numeric attributes

8

-Q <seed>

Seed for random data shuffling (default 1)

9

-doNotMakeSplitPointActualValue

Do not make split point actual value

ALGORITHM 4. J48

No.

Options

Description

1

-U

Use unpruned tree

2

-O

Do not collapse tree

3

-C <pruning confidence>

  • Set confidence threshold for pruning

  • (Default 0.25)

4

-M <minimum number of instances>

  • Set minimum number of instances per leaf

  • (Default 2)

5

-R

Use reduced error pruning

6

-N <number of folds>

  • Set number of folds for reduced error pruning. One-fold is used as pruning set

  • (Default 3)

7

-B

Use binary splits only

8

-S

Don’t perform subtree raising.

9

-L

Do not clean up after the tree has been built

10

-A

Laplace smoothing for predicted probabilities

11

-J

Do not use MDL correction for info gain on numeric attributes

12

-Q <seed>

  • Seed for random data shuffling

  • (Default 1)

13

-doNotMakeSplitPointActualValue

Do not make split point actual value

ALGORITHM 5. RandomForest

No.

Options

Description

1

-P

  • Size of each bag, as a percentage of the training set size

  • (Default 100)

2

-O

Calculate the out of bag error

3

-store-out-of-bag-predictions

Whether to store out of bag predictions in internal evaluation object

4

-output-out-of-bag-complexity-statistics

Whether to output complexity-based statistics when out-of-bag evaluation is performed

5

-print

Print the individual classifiers in the output

6

-attribute-importance

Compute and output attribute importance (mean impurity decrease method)

7

-I <num>

  • Number of iterations (i.e., the number of trees in the random forest)

  • (Current value 100)

8

-num-slots <num>

  • Number of execution slots

  • (Default 1 - i.e., no parallelism)

  • (Use 0 to auto-detect number of cores)

9

-K <number of attributes>

  • Number of attributes to randomly investigate

  • (Default 0)

  • (<1 = int(log_2(#predictors) +1))

10

-M <minimum number of instances>

  • Set minimum number of instances per leaf

  • (Default 1)

11

-V <minimum variance for split>

  • Set minimum numeric class variance proportion of train variance for split

  • (Default 1e-3)

12

-S <num>

  • Seed for random number generator

  • (Default 1)

13

-depth <num>

  • The maximum depth of the tree, 0 for unlimited

  • (Default 0)

14

-N <num>

  • Number of folds for backfitting

  • (Default 0, no backfitting)

15

-U

Allow unclassified instances

16

-B

Break ties randomly when several attributes look equally good

17

-output-debug-info

If set, classifier is run in debug mode and may output additional info to the console

18

-do-not-check-capabilities

  • If set, classifier capabilities are not checked before classifier is built

  • (Use with caution)

19

-num-decimal-places

  • The number of decimal places for the output of numbers in the model

  • (Default 2)

20

-batch-size

  • The desired batch size for batch prediction

  • (Default 100)

ALGORITHM 6. RandomTree

No.

Options

Description

1

-K <number of attributes>

  • Number of attributes to randomly investigate

  • (Default 0)

  • (<1 = int(log_2(#predictors) +1))

2

-M <minimum number of instances>

  • Set minimum number of instances per leaf

  • (Default 1)

3

-V <minimum variance for split>

  • Set minimum numeric class variance proportion of train variance for split

  • (Default 1e-3)

4

-S <num>

  • Seed for random number generator

  • (Default 1)

5

-depth <num>

  • The maximum depth of the tree, 0 for unlimited

  • (Default 0)

6

-N <num>

  • Number of folds for backfitting

  • (Default 0, no backfitting)

7

-U

Allow unclassified instances

8

-B

Break ties randomly when several attributes look equally good

9

-output-debug-info

If set, classifier is run in debug mode and may output additional info to the console

10

-do-not-check-capabilities

  • If set, classifier capabilities are not checked before classifier is built

  • (Use with caution)

11

-num-decimal-places

  • The number of decimal places for the output of numbers in the model

  • (Default 2)

Custom Model Configuration

ALGORITHM 7. Custom Model

The following table details the essential fields and their descriptions for configuring a custom model. This configuration applies to Python-based custom models that need to communicate with a Java environment using JSON format for data exchange.

No.

Fields

Description

1

Module Name

The file name of the main module, with its extension, in the custom model that you upload. This module must include the specified Method Name, which acts as the entry point for data evaluation. For example, if your custom model has a file named model.py containing a function evaluate_data, then model.py should be specified here.

2

Method Name

The name of the method in the main module that will process or evaluate the data. This method should be implemented to receive input data in JSON format, perform the necessary computations or evaluations, and return the results in JSON format. Ensure the method name matches exactly as defined in the code (case-sensitive). For example, if the method that handles the data is called evaluate_data, it should be specified here.

3

Custom Files

  • Upload all the files associated with your custom model to the HOST_SERVER. These may include scripts, configuration files, and requirements.txt that contain dependencies required for the model to function correctly.

  • Important Note: When re-uploading files, the new upload will overwrite any existing files in the custom model. To avoid any functionality issues, make sure you include all necessary files in the new upload, such as Python packages or external dependencies.

  • The Python model should be prepared to accept input and output in JSON format, enabling it to seamlessly interact with Java applications. The input JSON will typically contain the necessary data for the model, and the output JSON will contain the results of the evaluation.

JSON Input/Output Standard

To ensure smooth communication between Python and Java, both environments must agree on the format of the input and output data. Below are the specifications for the JSON format:

  1. Input JSON Structure: The input to the Python model should be a well-structured JSON object containing the data to be processed. The structure will vary depending on the specific requirements of the model but must follow a standard pattern so that the Java environment can easily generate and pass it to the Python model. For example:

    {
            "param1": "value1",
            "param2": "value2"
    }
    
  2. Output JSON Structure: After the Python model processes the data, it will return the results in a JSON object format. The output will need to follow a consistent structure so the Java application can handle the results properly. The output fields must exactly match the field names used in the evaluation datasource in the database, ensuring that the results can be correctly mapped to the database.

    • CLASS: The output class field is required and must be included. The field name must match exactly (case-sensitive) the CLASS field used in the evaluation datasource.

    • probability and matchingRule: These fields are optional and should only be included if the model’s logic supports them. If included, their field names must also match exactly (case-sensitive) with the names used in the evaluation datasource.

    For example:

    {
            "CLASS": "result_class_value",
            "probability": 0.85,
            "matchingRule": "some_rule"
    }
    

Example Python Code to Handle JSON Input/Output:

Here is an example of how the Python code can be structured to handle JSON input and output:

import json

def evaluate_data(input_json):
        # Parse the incoming JSON data
        input_data = json.loads(input_json)

        # Perform model computations (Example)
        result = {
                "CLASS": "result_class_value",  # Mandatory field
                "probability": 0.85,            # Optional field, only if logic supports it
                "matchingRule": "some_rule"     # Optional field, only if logic supports it
        }

        # Convert result to JSON and return it
        return json.dumps(result, indent=4)

This code ensures that the output JSON includes the CLASS field, as well as the optional probability and matchingRule fields, if they are part of the model’s logic. The output field names must match exactly (case-sensitive) the names used in the evaluation table in the database.

Model Testing Size and Number of Fold

Model Subtype

Information

Default Number

Custom Model

Testing Size

Training Size
N/A

N/A

WekaModel

Number of Fold for k-Cross Validation
5
#
# Convert testingSize to k-CrossValidation (Number of Folds)
#

[Conversion Formula]
      numberOfFolds = Math.round(1 / testingSize);

Model Important Parameters

Parameter

Type

Probability

isBestModel

Boolean

Denotes the model with the highest accuracy, where the best model is characterized by an accuracy value closest to 1.

isSelectedModel

Boolean

Refers to the model selected for calculating the probability or distance of an instance to the model. By default, the Best Model is also the Selected Model.

isExplainedModel

Boolean

Represents the model used to identify the matching or nearest rule for a given instance. By default, the Explained Model is the PART algorithm.

Model Train and Test Configurations

Parameter

Type

Probability

isTrained

Boolean

Indicates whether the model will be trained. By default, all Weka models are set to be trained (isTrained = true).

isTested

Boolean

Specifies whether the model will undergo testing. Testing evaluates the model’s performance (e.g., accuracy, correct/incorrect predictions, and total). By default, all Weka models are set to be tested (isTested = true). (Note: A model can only be tested if it has been trained).

Note

By default, any custom model uploaded is set with isTrained and isTested parameters. However, Ditto Hybrid Platform will not perform training or testing on these models. It is required that all custom models be explicitly trained and tested prior to being uploaded to the server.