Basic Workflow Walkthrough

RadarQC is a package that utilizes a methodology developed by Dr. Alex DesRosiers and Dr. Michael Bell to remove non-meteorological gates from Doppler radar scans by leveraging machine learning techniques. In its current form, it contains functionality to derive a set of features from input radar data, use these features to train a Random Forest classification model, and apply this model to the raw fields contained within the radar scans. It also has some model evaluation ability. The beginning of this guide will walk through a basic workflow to train a model starting from scratch.

API / Function References

Model Configuration

Ronin.ModelConfigType

Stuct used to store configuration information for a given model

Required arguments

num_models:::Int64

Number of ML models in the model chain. Can be one or more.

model_output_paths::Vector{String}

Vector containing paths to each model in the model chain. Should be same length as the number of models

met_probs::Vector{Tuple{Float32,Float32}}

Vector containing the decision range for a gate to be considered meteorological in each model in the chain. Example, if set to (.9, 1), > 90% of trees in the random forest must assign a gate a label of meteorological for it to be considered meteorological. The range is exclusive on both ends. That is, for a gate to be classified as non-meteorological, it must have a probability LESS THAN the low threshold, and for a gate to be classified as meteorological it must have a probability GREATER THAN the high threshold. For multi-pass models, gates between these thresholds (inclusive) will be sent on to the next pass. Form is (lowthreshold, highthreshold)

feature_output_paths::Vector{String}

Vector containing paths representing the locations to output calculated features to for each model in the chain.

input_path::String

Directory containing input radar data

task_mode::String

Whether to obtain feature tasks from a set of input files or user specified vector of strings. Planned to be implemented in a future release. For now, codebase behavior is agnostic to its value.

file_preprocessed::Vector{Bool}

For each model in the chain, contains a boolean value signifying if the correspondant feature output path has already been processed. If true, will open the file at this path instead of re-calculating input features.

Optional arguments

Input tasks and weights

The following arguments are only quasi-optional, one of them must be set.

    task_list::Vector{String} = [""]
    task_weights::Vector{Vector} = [[Matrix{Union{Float32, Missing}}(undef, 0,0)]]
Currently only `task_paths` are supported. Contains a vector of the same length as the number of
models, with each entry being the path to a file contianing the tasks for the pass. Future plans involve
allowing a usesr to specify vectors of tasks in `task_list`.

`task_weights` must be a vector of vectors, with the first dimension the same length as the number of models in the
chain. The second dimension much either be 1, containing the default weight matrix `Matrix{Union{Float32, Missing}}(undef, 0,0)`,
or a secondary vector of matrixes - one matrix for each task in the passs. Sample weight matrixes are defined in RoninConstants.jl
verbose::Bool = true

Whether to print out timing information, etc.

REMOVE_LOW_SIG_QUALITY::Bool = true

Whether to automatically remove gates that do not meet a basic Signal Quality threshold. Variable used to determine this specified in SIG_QUALITY_VAR

REMOVE_HIGH_PGG::Bool = true

Whether to automatically remove gates that do not meet a basic PGG threshold

HAS_INTERACTIVE_QC::Bool = false

Whether the radar data has already had interactive QC applied to it

QC_var::String = "VG"

If radar data has interactive QC already applied, the name of a variable that the QC has been applied to

remove_var::String = "VV"

Name of a raw variable in the radar data that can be used to determine the location of missing gates

FILL_VAL::Float32 = RoninConstants.FILL_VAL

Fill value for output cfradials

replace_missing::Bool = false

For spatial feature (AVG, STD, etc.) calculation, whether or not to replace MISSING gates in the mask area with FILL_VAL

write_out::Bool = true

Whether or not to write the calculated input features to disk, paths specified in featureoutputpaths

QC_mask::Bool = false

For the first model in the chain, whether or not to mask gates considered for feature calculation using a mask specified by mask_name More details elsewhere in the documentation.

mask_names::Vector{String} = [""]

List of names for masks in the model. Must be of same length as number of models in the chain. In the case of a model with QC_mask set to true, the first mask name in this vector should contain a string denoting the name of a field in all cfradial files that is dimensioned the same as the radar sweeps and contains values of missing where data is not to be considred, and values of float otherwise.

VARS_TO_QC::Vector{String} = ["VV", "ZZ"]

List of variables to apply QC to to get mask for next model in chain

QC_SUFFIX::String

Postfix to apply to variable name once QC has been applied.

class_weights::String = ""

Class weighting scheme to apply in the training of RF model. Currently only "balanced" is implemented.

n_trees::Int = 21

Number of trees in the random forest

max_depth::Int = 14

Maximum depth of any one tree in the random forest

overwrite_output::Bool = false

If true, will remove/overwrite existing files when internal functionality attempts to write new data to them

SIG_QUALITY_THRESHOLD::Float32 = .2

If REMOVELOWNCP is set to true, threshold at or below which to remove data.

PGG_THRESHOLD::Float32 = 1.

If REMOVEHIGHPGG is set to true, threshold at or above which to remove data.

SIGNAL_QUALITY_VAR::String = "NCP"

Name of variable in cfradial file representing signal quality. Most commonly "NCP" or "SQI"

source

Calculating Model Input Features

Ronin.calculate_featuresMethod

Function to process a set of cfradial files and produce input features for training/evaluating a model

Required arguments

input_loc::String

Path to input cfradial or directory of input cfradials

argument_file::String

Path to configuration file containing which features to calculate

output_file::String

Path to output calculated features to (generally ends in .h5)

HAS_INTERACTIVE_QC::Bool

Specifies whether or not the file(s) have already undergone a interactive QC procedure. If true, function will also output a Y array used to verify where interactive QC removed gates. This array is formed by considering where gates with non-missing data in raw scans (specified by remove_variable) are set to missing after QC is performed.

Optional keyword arguments

verbose::Bool=false

If true, will print out timing information as each file is processed

REMOVE_LOW_SIG_QUALITY::Bool=false

If true, will ignore gates with Normalized Coherent Power/Signal Quality Index below a threshold specified in RQCFeatures.jl

SIG_QUALITY_THRESHOLD::Float32 = .2

Theshold at or below which to remove data

SIG_QUALITY_VAR::String = "NCP"

Name of variable containing signal quality parameter

REMOVE_HIGH_PGG::Bool=false

If true, will ignore gates with Probability of Ground Gate (PGG) values at or above a threshold specified in RQCFeatures.jl

PGG_THRESHOLD

Threshold at or above which to remove data

QC_variable::String="VG"

Name of variable in input NetCDF files that has been quality-controlled.

remove_variable::String="VV"

Name of a raw variable in input NetCDF files. Used to determine where missing data exists in the input sweeps. Data at these locations will be removed from the outputted features.

replace_missing::Bool=false

Whether or not to replace MISSING values with FILL_VAL in spatial parameter calculations Default value: False

write_out::Bool=true

Whether or not to write features out to file

return_idxer::Bool = false

If true, will return IDXER, where IDXER is a

weight_matrixes::Vector{Matrix{Union{Missing, Float32}}} = [(undef, 0,0)]

Vector containing a weight matrix for every task in the argument file. For non-spatial parameters, the weights are discarded, and so dummy/placeholder matrixes may be used.

source
Ronin.split_training_testing!Function

Function to split a given directory or set of directories into training and testing files using the configuration described in DesRosiers and Bell 2023. This function assumes that input directories only contain cfradial files that follow standard naming conventions, and are thus implicitly chronologically ordered. The function operates by first dividing file names into training and testing sets following an 80/20 training/testing split, and subsequently softlinking each file to the training and testing directories. Attempts to avoid temporal autocorrelation while maximizing variance by dividing each case into several different training/testing sections.

An important note: Always use absolute paths, relative paths will cause issues with the simlinks

Required Arguments:

DIR_PATHS::Vector{String}

List of directories containing cfradials to be used for model training/testing. Useful if input data is split into several different cases.

TRAINING_PATH::String

Directory to softlink files designated for training into.

TESTING_PATH::String

Directory to softlink files designated for testing into.

source
Ronin.train_modelMethod

Function to train a random forest model using a precalculated set of input and output features (usually output from calculate_features). Returns nothing.

Required arguments

input_h5::String

Location of input features/targets. Input features are expected to have the name "X", and targets the name "Y". This should be taken care of automatically if they are outputs from calculate_features

model_location::String

Path to save the trained model out to. Typically should end in .jld2

Optional keyword arguments

verify::Bool = false

Whether or not to output a separate .h5 file containing the trained models predictions on the training set (Y_PREDICTED) as well as the targets for the training set (Y_ACTUAL)

verify_out::String="model_verification.h5"

If verify, the location to output this verification to.

col_subset=:

Set of columns from input_h5 to train model on. Useful if one wishes to train a model while excluding some features from a training set.

row_subset=:

Set of rows from input_h5 to train on.

n_trees::Int = 21

Number of trees in the Random Forest ensemble

max_depth::Int = 14

Maximum node depth in each tree in RF ensemble

class_weights::Vector{Float32} = Vector{Float32}([1.,2.])

Vector of class weights to apply to each observation. Should be 1 observation per sample in the input data files

source
Ronin.remove_validationFunction

Function used to remove a given subset of the rows from a feature set so that they may be used for model validation/tuning.

Currently configured to utilize the 90/10 split described in DesRosiers and Bell 2023.

Required arguments

input_dataset::String

Path to h5 files containing model features

Optional keyword arguments

training_output::String = "train_no_validation_set.h5"

Path to output training features with validation removed to

validation_output::String = "validation.h5"

Path to output validation features to

remove_original::Bool = true 

Whether or not to remove the original file described by the input_dataset path.

source
Ronin.get_feature_importanceMethod

Uses L1 regression with a variety of λ penalty values to determine the most useful features for

input to the random forest model.


Required Input


input_file_path::String

Path to .h5 file containing model training features under ["X"] parameter, and model targets under ["Y"] parameter. Also expects the h5 file to contain an attribute known as Parameters containing abbreviations for the feature types

λs::Vector{Float32}

Vector of values used to vary the strength of the penalty term in the regularization.

Optional Keyword Arguments


pred_threshold::Float32

Minimum cofidence level for binary classifier when predicting

Returns

Returns a DataFrame with each row containing info about a regression for a specific λ, the values of the regression coefficients for each input feature, and the Root Mean Square Error of the resultant regression.

source

Applying and evaluating a trained model to data

Ronin.QC_scanFunction

QC_scan(input_cfrad::String, features::Matrix{Float32}, indexer::Vector{Bool}, config::ModelConfig, iter::Int64)

source
QC_scan(config::ModelConfig)

Applies trained composite model to data within scan or set of scans. Will set gates the model deems to be non-meteorological to MISSING, including gates that do not meet initial basic quality control thresholds. Wrapper around composite_prediction.

Returns: None

source
QC_scan(config::ModelConfig, filepath::String, predictions::Vector{Bool}, init_idxer::Vector{Bool})

Internal function to apply QC to a scan specified by `filepath` using the predictions/indexer specified
by `predictions` and `init_idxer`. Generally used in the context of a multi-pass model.

`config::ModelConfig`
source
Ronin.characterize_misclassified_gatesFunction

characterize_misclassified_gates(config::ModelConfig; model_pretrained::Bool = true, features_precalculated::Bool = true)

Function used to apply composite model to a set of gates, returning information about gate classifications and their associated input features

Required inputs

    config::ModelConfig

Model configuration object containing setup information.

Optional Inputs

model_pretrained::Bool = true

Model training in this function not currently implemented, setting to false with untrained models will result in errors.

features_precalculated::Bool = true

Whether or not the input features for the model have already been written to disk.

Not currently implemented.

Returns

Vector of dataframes (one DataFrame for each model "pass"). DataFrames will only contain information about gates reciving their final classification during that pass of the model. That is, if a gate exceeds the met_probs thresholds and is not passed on to the next pass, it will be represented in the DataFrame corresponding to that present pass of the model.

source

Predicting using a composite model

Ronin.train_multi_modelFunction
train_multi_model(config::ModelConfig)

All-in-one function to take in a set of radar data, calculate input features, and train a chain of random forest models for meteorological/non-meteorological gate identification.

#Required arguments

config::ModelConfig

Struct containing configuration info for model training

#Returns -None

source
Ronin.composite_predictionFunction
composite_prediction(config::ModelConfig; write_features_out::Bool=false, feature_outfile::String="placeholder.h5", return_probs::Bool=false)

Passes feature data through a model or series of models and returns model classifications. Applies configuration such as masking and basic QC (high PGG/low NCP) specified by config

Optional keyword arguments

write_predictions_out::Bool = false

If true, will write the predictions to disk

prediction_outfile::String = "model_predictions.h5"

Location to write predictions to on disk

return_probs::Bool = false

If set to true, will return probability of meteorological gate for all gates. More detail below.


QC_mode::Bool = false

If set to true, the function will instead be used to apply quality control to a (set of) scan(s)

Returns

  • predictions::Vector{Bool} Model classifications for gates that passed basic quality control thresholds

  • values::BitVector Verification gates correspondant to predictions

  • init_idxers::Vector{Vector{Float32}} Information about where original radar data did/did not meet basic quality control thresholds. Each vector contains a flattened vector describing whether or not a given gate was predicted on.

  • total_met_probs::Vector{Float32}If kewyword argument returnprobs is set to true, then `totalmet_probs` will be returned. Each entry into this vector corresponds to the gate represented by predictions and values, and denotes the fraction of trees in the random forest that classified the gate as meteorological.

    All values returned will be only those that passed quality control checks in the first pass of the model minimum NCP / PGG thresholds. In order to reconstruct a scan, user would need to use the values in the returned indexers.

source

Non-user facing

Ronin.get_task_paramsFunction

Function to parse a given task list Also performs checks to ensure that the specified tasks are able to be performed to the specified CFRad file

source

Parses input parameter file for use in outputting feature names to HDF5 file as attributes. NOTE: Cfradial-unaware. If one of the variables is specified incorrectly in the parameter file, will cause errors

source

Passthrough when tasks are already provided as a vector of strings

source
Ronin.process_single_fileFunction

###Wrapper version of processsinglefile that allows user to specify a vector of weight matrixes ###In this case will also pass the tasks to complete as a vector ###weight_matrixes are also implicitly the window size

Returns:

-X::Matrix{Float32}: Matrix that is dimensioned (numgates x numfeatures) where numgates is the number of valid (non-missing, meeting NCP/PGG thresholds, non-masked) gates the function finds, and numfeatures is the number of features specified in the argument file to calculate.

-Y::Matrix{Bool} : IF HASINTERACTIVEQC == true, will return Y, array containing 1 if a datapoint was retained during interactive QC, and 0 otherwise. Dimensioned as (num_gates x 1)

-INDEXER::Vector{Bool} : Based on removevariable as described above, contains boolean array specifiying where in the scan features valid data and where does not. Will also contain false where values in `featuremask` are false.

source

Driver function that calculates a set of features from a single CFRadial file. Features are specified in file located at argfile_path.

Will return a tuple of (X, Y, indexer) where X is the features matrix, Y, a matrix containing the verification

  • where human QC determined the gate was meteorological (value of 1), or non-meteorological (value of 0),

and indexer contains a vector of booleans describing which gates met basic quality control thresholds and thus are represented in the X and Y matrixes

Weight matrixes are specified in file header, or passed as explicit argument.

Required arguments

cfrad::NCDataset 

Input NCDataset containing radar scan variables

tasks::Vector{String} 

Vector of inpuit features to calculate

Optional keyword arguments

HAS_INTERACTIVE_QC::Bool = false

If the scan has already had a human apply quality control to it, set to true. Otherwise, false

REMOVE_LOW_SIG_QUALITY::Bool = false

Whether or not to ignore gates that do not meet a minimum NCP/SQI threshold. If true, these gates will be set to false in indexer, and features/verification will not be calculated for them.

SIG_QUALITY_THRESHOLD::Float32 = .2

Theshold at or below which to remove data

SIG_QUALITY_VAR

Name of variable in cfradials containing information about signal quality

REMOVE_HIGH_PGG::Bool = false

Whether or not to ignore gates that exceed a given Probability of Ground Gate(PGG) threshold. If true, these gates will be set to false in indexer, and features/verification will not be calculated for them.

PGG_THRESHOLD

Threshold at or above which to remove data

QC_variable::String = "VG"

Name of a variable in input CFRadial file that has had QC applied to it already. Used to calculate verification Y matrix.

remove_variable::String = "VV" 

Name of raw variable in input CFRadial file that will be used to determine where missing gates exist in the sweep.

replace_missing::Bool = false

For spatial parameters, whether or not to replace missings values with FILL_VAL


Returns:

-X::Matrix{Float32}: Matrix that is dimensioned (num_gates x num_features) where num_gates is the number of valid 
    (non-missing, meeting NCP/PGG thresholds, non-masked) gates the function finds, and num_features is the 
    number of features specified in the argument file to calculate. 

-Y::Matrix{Bool} : IF HAS_INTERACTIVE_QC == true, will return Y, array containing 1 if a datapoint was retained 
    during interactive QC, and 0 otherwise. Dimensioned as (num_gates x 1)

-INDEXER::Vector{Bool} : Based on remove_variable as described above, contains boolean array specifiying
            where in the scan features valid data and where does not. Will also contain `false` where 
            values in `feature_mask` are false.
source