Model

WekaModel Configuration

ALGORITHM 1. JRip

No.	Options	Description
1	-F <number of folds>	Set number of folds for REP One-fold is used as pruning set (Default 3)
2	-N <min. weights>	Set the minimal weights of instances within a split (Default 2.0)
3	-O <number of runs>	Set the number of runs of optimizations (Default: 2)
4	-D	Set whether turn on the debug mode (Default: false)
5	-S <seed>	The seed of randomization (Default: 1)
6	-E	Whether NOT check the error rate>=0.5 in stopping criteria (Default: check)
7	-P	Whether NOT use pruning (Default: use pruning)

ALGORITHM 2. Decision Table

No.	Options	Description
1	-S <search method specification>	Full class name of search method, followed by its options.eg: “weka.attributeSelection.BestFirst -D 1” (Default weka.attributeSelection.BestFirst)
1.1	-P <start set>	Specify a starting set of attributes. E.g., 1,3,5-7
1.2	-D <0 = backward \| 1 = forward \| 2 = bi-directional>	Direction of search (default = 1)
1.3	-N <num>	Number of non-improving nodes to consider before terminating search
1.4	-S <num>	Size of lookup cache for evaluated subsets Expressed as a multiple of the number of attributes in the data set. (default = 1)
2	-X <number of folds>	Use cross validation to evaluate features Use number of folds = 1 for leave one out CV (Default = leave one out CV)
3	-E <acc \| rmse \| mae \| auc>	Performance evaluation measure to use for selecting attributes. (Default = accuracy for discrete class and rmse for numeric class)
4	-I	Use nearest neighbor instead of global table majority
5	-R	Display decision table rules

ALGORITHM 3. PART

No.	Options	Description
1	-C <pruning confidence>	Set confidence threshold for pruning (Default 0.25)
2	-M <minimum number of objects>	Set minimum number of objects per leaf (Default 2)
3	-R	Use reduced error pruning
4	-N <number of folds>	Set number of folds for reduced error pruning. One-fold is used as pruning set (Default 3)
5	-B	Use binary splits only
6	-U	Generate unpruned decision list
7	-J	Do not use MDL correction for info gain on numeric attributes
8	-Q <seed>	Seed for random data shuffling (default 1)
9	-doNotMakeSplitPointActualValue	Do not make split point actual value

ALGORITHM 4. J48

No.	Options	Description
1	-U	Use unpruned tree
2	-O	Do not collapse tree
3	-C <pruning confidence>	Set confidence threshold for pruning (Default 0.25)
4	-M <minimum number of instances>	Set minimum number of instances per leaf (Default 2)
5	-R	Use reduced error pruning
6	-N <number of folds>	Set number of folds for reduced error pruning. One-fold is used as pruning set (Default 3)
7	-B	Use binary splits only
8	-S	Don’t perform subtree raising.
9	-L	Do not clean up after the tree has been built
10	-A	Laplace smoothing for predicted probabilities
11	-J	Do not use MDL correction for info gain on numeric attributes
12	-Q <seed>	Seed for random data shuffling (Default 1)
13	-doNotMakeSplitPointActualValue	Do not make split point actual value

ALGORITHM 5. RandomForest

No.	Options	Description
1	-P	Size of each bag, as a percentage of the training set size (Default 100)
2	-O	Calculate the out of bag error
3	-store-out-of-bag-predictions	Whether to store out of bag predictions in internal evaluation object
4	-output-out-of-bag-complexity-statistics	Whether to output complexity-based statistics when out-of-bag evaluation is performed
5	-print	Print the individual classifiers in the output
6	-attribute-importance	Compute and output attribute importance (mean impurity decrease method)
7	-I <num>	Number of iterations (i.e., the number of trees in the random forest) (Current value 100)
8	-num-slots <num>	Number of execution slots (Default 1 - i.e., no parallelism) (Use 0 to auto-detect number of cores)
9	-K <number of attributes>	Number of attributes to randomly investigate (Default 0) (<1 = int(log_2(#predictors) +1))
10	-M <minimum number of instances>	Set minimum number of instances per leaf (Default 1)
11	-V <minimum variance for split>	Set minimum numeric class variance proportion of train variance for split (Default 1e-3)
12	-S <num>	Seed for random number generator (Default 1)
13	-depth <num>	The maximum depth of the tree, 0 for unlimited (Default 0)
14	-N <num>	Number of folds for backfitting (Default 0, no backfitting)
15	-U	Allow unclassified instances
16	-B	Break ties randomly when several attributes look equally good
17	-output-debug-info	If set, classifier is run in debug mode and may output additional info to the console
18	-do-not-check-capabilities	If set, classifier capabilities are not checked before classifier is built (Use with caution)
19	-num-decimal-places	The number of decimal places for the output of numbers in the model (Default 2)
20	-batch-size	The desired batch size for batch prediction (Default 100)

ALGORITHM 6. RandomTree

No.	Options	Description
1	-K <number of attributes>	Number of attributes to randomly investigate (Default 0) (<1 = int(log_2(#predictors) +1))
2	-M <minimum number of instances>	Set minimum number of instances per leaf (Default 1)
3	-V <minimum variance for split>	Set minimum numeric class variance proportion of train variance for split (Default 1e-3)
4	-S <num>	Seed for random number generator (Default 1)
5	-depth <num>	The maximum depth of the tree, 0 for unlimited (Default 0)
6	-N <num>	Number of folds for backfitting (Default 0, no backfitting)
7	-U	Allow unclassified instances
8	-B	Break ties randomly when several attributes look equally good
9	-output-debug-info	If set, classifier is run in debug mode and may output additional info to the console
10	-do-not-check-capabilities	If set, classifier capabilities are not checked before classifier is built (Use with caution)
11	-num-decimal-places	The number of decimal places for the output of numbers in the model (Default 2)

Custom Model Configuration

ALGORITHM 7. Custom Model

The following table details the essential fields and their descriptions for configuring a custom model. This configuration applies to Python-based custom models that need to communicate with a Java environment using JSON format for data exchange.

No.	Fields	Description
1	Module Name	The file name of the main module, with its extension, in the custom model that you upload. This module must include the specified Method Name, which acts as the entry point for data evaluation. For example, if your custom model has a file named `model.py` containing a function `evaluate_data`, then `model.py` should be specified here.
2	Method Name	The name of the method in the main module that will process or evaluate the data. This method should be implemented to receive input data in JSON format, perform the necessary computations or evaluations, and return the results in JSON format. Ensure the method name matches exactly as defined in the code (case-sensitive). For example, if the method that handles the data is called `evaluate_data`, it should be specified here.
3	Custom Files	Upload all the files associated with your custom model to the `HOST_SERVER`. These may include scripts, configuration files, and requirements.txt that contain dependencies required for the model to function correctly. Important Note: When re-uploading files, the new upload will overwrite any existing files in the custom model. To avoid any functionality issues, make sure you include all necessary files in the new upload, such as Python packages or external dependencies. The Python model should be prepared to accept input and output in JSON format, enabling it to seamlessly interact with Java applications. The input JSON will typically contain the necessary data for the model, and the output JSON will contain the results of the evaluation.

JSON Input/Output Standard

To ensure smooth communication between Python and Java, both environments must agree on the format of the input and output data. Below are the specifications for the JSON format:

Input JSON Structure: The input to the Python model should be a well-structured JSON object containing the data to be processed. The structure will vary depending on the specific requirements of the model but must follow a standard pattern so that the Java environment can easily generate and pass it to the Python model. For example:
{ "param1": "value1", "param2": "value2" }
Output JSON Structure: After the Python model processes the data, it will return the results in a JSON object format. The output will need to follow a consistent structure so the Java application can handle the results properly. The output fields must exactly match the field names used in the evaluation datasource in the database, ensuring that the results can be correctly mapped to the database.
- CLASS: The output class field is required and must be included. The field name must match exactly (case-sensitive) the CLASS field used in the evaluation datasource.
- probability and matchingRule: These fields are optional and should only be included if the model’s logic supports them. If included, their field names must also match exactly (case-sensitive) with the names used in the evaluation datasource.
For example:
{ "CLASS": "result_class_value", "probability": 0.85, "matchingRule": "some_rule" }

Example Python Code to Handle JSON Input/Output:

Here is an example of how the Python code can be structured to handle JSON input and output:

import json

def evaluate_data(input_json):
        # Parse the incoming JSON data
        input_data = json.loads(input_json)

        # Perform model computations (Example)
        result = {
                "CLASS": "result_class_value",  # Mandatory field
                "probability": 0.85,            # Optional field, only if logic supports it
                "matchingRule": "some_rule"     # Optional field, only if logic supports it
        }

        # Convert result to JSON and return it
        return json.dumps(result, indent=4)

This code ensures that the output JSON includes the CLASS field, as well as the optional probability and matchingRule fields, if they are part of the model’s logic. The output field names must match exactly (case-sensitive) the names used in the evaluation table in the database.

Model Testing Size and Number of Fold

Model Subtype	Information	Default Number
Custom Model	Testing Size Training Size	N/A N/A
WekaModel	Number of Fold for k-Cross Validation	5

#
# Convert testingSize to k-CrossValidation (Number of Folds)
#

[Conversion Formula]
      numberOfFolds = Math.round(1 / testingSize);

Model Important Parameters

Parameter	Type	Probability
isBestModel	Boolean	Denotes the model with the highest accuracy, where the best model is characterized by an accuracy value closest to 1.
isSelectedModel	Boolean	Refers to the model selected for calculating the probability or distance of an instance to the model. By default, the Best Model is also the Selected Model.
isExplainedModel	Boolean	Represents the model used to identify the matching or nearest rule for a given instance. By default, the Explained Model is the PART algorithm.

Model Train and Test Configurations

Parameter	Type	Probability
isTrained	Boolean	Indicates whether the model will be trained. By default, all Weka models are set to be trained (isTrained = true).
isTested	Boolean	Specifies whether the model will undergo testing. Testing evaluates the model’s performance (e.g., accuracy, correct/incorrect predictions, and total). By default, all Weka models are set to be tested (isTested = true). (Note: A model can only be tested if it has been trained).

Note

By default, any custom model uploaded is set with isTrained and isTested parameters. However, Ditto Hybrid Platform will not perform training or testing on these models. It is required that all custom models be explicitly trained and tested prior to being uploaded to the server.