Basic Workflow Walkthrough
RadarQC is a package that utilizes a methodology developed by Dr. Alex DesRosiers and Dr. Michael Bell to remove non-meteorological gates from Doppler radar scans by leveraging machine learning techniques. In its current form, it contains functionality to derive a set of features from input radar data, use these features to train a Random Forest classification model, and apply this model to the raw fields contained within the radar scans. It also has some model evaluation ability. The beginning of this guide will walk through a basic workflow to train a model starting from scratch.
API / Function References
Model Configuration
Ronin.ModelConfig — Type
Stuct used to store configuration information for a given model
Required arguments
num_models:::Int64Number of ML models in the model chain. Can be one or more.
model_output_paths::Vector{String}Vector containing paths to each model in the model chain. Should be same length as the number of models
met_probs::Vector{Tuple{Float32,Float32}}Vector containing the decision range for a gate to be considered meteorological in each model in the chain. Example, if set to (.9, 1), > 90% of trees in the random forest must assign a gate a label of meteorological for it to be considered meteorological. The range is exclusive on both ends. That is, for a gate to be classified as non-meteorological, it must have a probability LESS THAN the low threshold, and for a gate to be classified as meteorological it must have a probability GREATER THAN the high threshold. For multi-pass models, gates between these thresholds (inclusive) will be sent on to the next pass. Form is (lowthreshold, highthreshold)
feature_output_paths::Vector{String}Vector containing paths representing the locations to output calculated features to for each model in the chain.
input_path::StringDirectory containing input radar data
task_mode::StringWhether to obtain feature tasks from a set of input files or user specified vector of strings. Planned to be implemented in a future release. For now, codebase behavior is agnostic to its value.
file_preprocessed::Vector{Bool}For each model in the chain, contains a boolean value signifying if the correspondant feature output path has already been processed. If true, will open the file at this path instead of re-calculating input features.
Optional arguments
Input tasks and weights
The following arguments are only quasi-optional, one of them must be set.
task_list::Vector{String} = [""]
task_weights::Vector{Vector} = [[Matrix{Union{Float32, Missing}}(undef, 0,0)]]Currently only `task_paths` are supported. Contains a vector of the same length as the number of
models, with each entry being the path to a file contianing the tasks for the pass. Future plans involve
allowing a usesr to specify vectors of tasks in `task_list`.
`task_weights` must be a vector of vectors, with the first dimension the same length as the number of models in the
chain. The second dimension much either be 1, containing the default weight matrix `Matrix{Union{Float32, Missing}}(undef, 0,0)`,
or a secondary vector of matrixes - one matrix for each task in the passs. Sample weight matrixes are defined in RoninConstants.jlverbose::Bool = trueWhether to print out timing information, etc.
REMOVE_LOW_SIG_QUALITY::Bool = trueWhether to automatically remove gates that do not meet a basic Signal Quality threshold. Variable used to determine this specified in SIG_QUALITY_VAR
REMOVE_HIGH_PGG::Bool = trueWhether to automatically remove gates that do not meet a basic PGG threshold
HAS_INTERACTIVE_QC::Bool = falseWhether the radar data has already had interactive QC applied to it
QC_var::String = "VG"If radar data has interactive QC already applied, the name of a variable that the QC has been applied to
remove_var::String = "VV"Name of a raw variable in the radar data that can be used to determine the location of missing gates
FILL_VAL::Float32 = RoninConstants.FILL_VALFill value for output cfradials
replace_missing::Bool = falseFor spatial feature (AVG, STD, etc.) calculation, whether or not to replace MISSING gates in the mask area with FILL_VAL
write_out::Bool = trueWhether or not to write the calculated input features to disk, paths specified in featureoutputpaths
QC_mask::Bool = falseFor the first model in the chain, whether or not to mask gates considered for feature calculation using a mask specified by mask_name More details elsewhere in the documentation.
mask_names::Vector{String} = [""]List of names for masks in the model. Must be of same length as number of models in the chain. In the case of a model with QC_mask set to true, the first mask name in this vector should contain a string denoting the name of a field in all cfradial files that is dimensioned the same as the radar sweeps and contains values of missing where data is not to be considred, and values of float otherwise.
VARS_TO_QC::Vector{String} = ["VV", "ZZ"]List of variables to apply QC to to get mask for next model in chain
QC_SUFFIX::StringPostfix to apply to variable name once QC has been applied.
class_weights::String = ""Class weighting scheme to apply in the training of RF model. Currently only "balanced" is implemented.
n_trees::Int = 21Number of trees in the random forest
max_depth::Int = 14Maximum depth of any one tree in the random forest
overwrite_output::Bool = falseIf true, will remove/overwrite existing files when internal functionality attempts to write new data to them
SIG_QUALITY_THRESHOLD::Float32 = .2If REMOVELOWNCP is set to true, threshold at or below which to remove data.
PGG_THRESHOLD::Float32 = 1.If REMOVEHIGHPGG is set to true, threshold at or above which to remove data.
SIGNAL_QUALITY_VAR::String = "NCP"Name of variable in cfradial file representing signal quality. Most commonly "NCP" or "SQI"
Calculating Model Input Features
Ronin.calculate_features — Method
Function to process a set of cfradial files and produce input features for training/evaluating a model
Required arguments
input_loc::StringPath to input cfradial or directory of input cfradials
argument_file::StringPath to configuration file containing which features to calculate
output_file::StringPath to output calculated features to (generally ends in .h5)
HAS_INTERACTIVE_QC::BoolSpecifies whether or not the file(s) have already undergone a interactive QC procedure. If true, function will also output a Y array used to verify where interactive QC removed gates. This array is formed by considering where gates with non-missing data in raw scans (specified by remove_variable) are set to missing after QC is performed.
Optional keyword arguments
verbose::Bool=falseIf true, will print out timing information as each file is processed
REMOVE_LOW_SIG_QUALITY::Bool=falseIf true, will ignore gates with Normalized Coherent Power/Signal Quality Index below a threshold specified in RQCFeatures.jl
SIG_QUALITY_THRESHOLD::Float32 = .2Theshold at or below which to remove data
SIG_QUALITY_VAR::String = "NCP"Name of variable containing signal quality parameter
REMOVE_HIGH_PGG::Bool=falseIf true, will ignore gates with Probability of Ground Gate (PGG) values at or above a threshold specified in RQCFeatures.jl
PGG_THRESHOLDThreshold at or above which to remove data
QC_variable::String="VG"Name of variable in input NetCDF files that has been quality-controlled.
remove_variable::String="VV"Name of a raw variable in input NetCDF files. Used to determine where missing data exists in the input sweeps. Data at these locations will be removed from the outputted features.
replace_missing::Bool=falseWhether or not to replace MISSING values with FILL_VAL in spatial parameter calculations Default value: False
write_out::Bool=trueWhether or not to write features out to file
return_idxer::Bool = falseIf true, will return IDXER, where IDXER is a
weight_matrixes::Vector{Matrix{Union{Missing, Float32}}} = [(undef, 0,0)]Vector containing a weight matrix for every task in the argument file. For non-spatial parameters, the weights are discarded, and so dummy/placeholder matrixes may be used.
Ronin.split_training_testing! — Function
Function to split a given directory or set of directories into training and testing files using the configuration described in DesRosiers and Bell 2023. This function assumes that input directories only contain cfradial files that follow standard naming conventions, and are thus implicitly chronologically ordered. The function operates by first dividing file names into training and testing sets following an 80/20 training/testing split, and subsequently softlinking each file to the training and testing directories. Attempts to avoid temporal autocorrelation while maximizing variance by dividing each case into several different training/testing sections.
An important note: Always use absolute paths, relative paths will cause issues with the simlinks
Required Arguments:
DIR_PATHS::Vector{String}List of directories containing cfradials to be used for model training/testing. Useful if input data is split into several different cases.
TRAINING_PATH::StringDirectory to softlink files designated for training into.
TESTING_PATH::StringDirectory to softlink files designated for testing into.
Ronin.train_model — Method
Function to train a random forest model using a precalculated set of input and output features (usually output from calculate_features). Returns nothing.
Required arguments
input_h5::StringLocation of input features/targets. Input features are expected to have the name "X", and targets the name "Y". This should be taken care of automatically if they are outputs from calculate_features
model_location::StringPath to save the trained model out to. Typically should end in .jld2
Optional keyword arguments
verify::Bool = falseWhether or not to output a separate .h5 file containing the trained models predictions on the training set (Y_PREDICTED) as well as the targets for the training set (Y_ACTUAL)
verify_out::String="model_verification.h5"If verify, the location to output this verification to.
col_subset=:Set of columns from input_h5 to train model on. Useful if one wishes to train a model while excluding some features from a training set.
row_subset=:Set of rows from input_h5 to train on.
n_trees::Int = 21Number of trees in the Random Forest ensemble
max_depth::Int = 14Maximum node depth in each tree in RF ensemble
class_weights::Vector{Float32} = Vector{Float32}([1.,2.])Vector of class weights to apply to each observation. Should be 1 observation per sample in the input data files
Ronin.remove_validation — Function
Function used to remove a given subset of the rows from a feature set so that they may be used for model validation/tuning.
Currently configured to utilize the 90/10 split described in DesRosiers and Bell 2023.
Required arguments
input_dataset::StringPath to h5 files containing model features
Optional keyword arguments
training_output::String = "train_no_validation_set.h5"Path to output training features with validation removed to
validation_output::String = "validation.h5"Path to output validation features to
remove_original::Bool = true Whether or not to remove the original file described by the input_dataset path.
Ronin.get_feature_importance — Method
Uses L1 regression with a variety of λ penalty values to determine the most useful features for
input to the random forest model.
Required Input
input_file_path::StringPath to .h5 file containing model training features under ["X"] parameter, and model targets under ["Y"] parameter. Also expects the h5 file to contain an attribute known as Parameters containing abbreviations for the feature types
λs::Vector{Float32}Vector of values used to vary the strength of the penalty term in the regularization.
Optional Keyword Arguments
pred_threshold::Float32Minimum cofidence level for binary classifier when predicting
Returns
Returns a DataFrame with each row containing info about a regression for a specific λ, the values of the regression coefficients for each input feature, and the Root Mean Square Error of the resultant regression.
Applying and evaluating a trained model to data
Ronin.QC_scan — Function
QC_scan(input_cfrad::String, features::Matrix{Float32}, indexer::Vector{Bool}, config::ModelConfig, iter::Int64)
QC_scan(config::ModelConfig)Applies trained composite model to data within scan or set of scans. Will set gates the model deems to be non-meteorological to MISSING, including gates that do not meet initial basic quality control thresholds. Wrapper around composite_prediction.
Returns: None
QC_scan(config::ModelConfig, filepath::String, predictions::Vector{Bool}, init_idxer::Vector{Bool})
Internal function to apply QC to a scan specified by `filepath` using the predictions/indexer specified
by `predictions` and `init_idxer`. Generally used in the context of a multi-pass model.
`config::ModelConfig`Ronin.characterize_misclassified_gates — Function
characterize_misclassified_gates(config::ModelConfig; model_pretrained::Bool = true, features_precalculated::Bool = true)
Function used to apply composite model to a set of gates, returning information about gate classifications and their associated input features
Required inputs
config::ModelConfigModel configuration object containing setup information.
Optional Inputs
model_pretrained::Bool = trueModel training in this function not currently implemented, setting to false with untrained models will result in errors.
features_precalculated::Bool = trueWhether or not the input features for the model have already been written to disk.
Not currently implemented.
Returns
Vector of dataframes (one DataFrame for each model "pass"). DataFrames will only contain information about gates reciving their final classification during that pass of the model. That is, if a gate exceeds the met_probs thresholds and is not passed on to the next pass, it will be represented in the DataFrame corresponding to that present pass of the model.
Predicting using a composite model
Ronin.train_multi_model — Function
train_multi_model(config::ModelConfig)All-in-one function to take in a set of radar data, calculate input features, and train a chain of random forest models for meteorological/non-meteorological gate identification.
#Required arguments
config::ModelConfigStruct containing configuration info for model training
#Returns -None
Ronin.composite_prediction — Function
composite_prediction(config::ModelConfig; write_features_out::Bool=false, feature_outfile::String="placeholder.h5", return_probs::Bool=false)Passes feature data through a model or series of models and returns model classifications. Applies configuration such as masking and basic QC (high PGG/low NCP) specified by config
Optional keyword arguments
write_predictions_out::Bool = falseIf true, will write the predictions to disk
prediction_outfile::String = "model_predictions.h5"Location to write predictions to on disk
return_probs::Bool = falseIf set to true, will return probability of meteorological gate for all gates. More detail below.
QC_mode::Bool = falseIf set to true, the function will instead be used to apply quality control to a (set of) scan(s)
Returns
predictions::Vector{Bool}Model classifications for gates that passed basic quality control thresholdsvalues::BitVectorVerification gates correspondant to predictionsinit_idxers::Vector{Vector{Float32}}Information about where original radar data did/did not meet basic quality control thresholds. Each vector contains a flattened vector describing whether or not a given gate was predicted on.total_met_probs::Vector{Float32}If kewyword argument returnprobs is set totrue, then `totalmet_probs` will be returned. Each entry into this vector corresponds to the gate represented by predictions and values, and denotes the fraction of trees in the random forest that classified the gate as meteorological.All values returned will be only those that passed quality control checks in the first pass of the model minimum NCP / PGG thresholds. In order to reconstruct a scan, user would need to use the values in the returned indexers.
Non-user facing
Ronin.get_task_params — Function
Function to parse a given task list Also performs checks to ensure that the specified tasks are able to be performed to the specified CFRad file
Parses input parameter file for use in outputting feature names to HDF5 file as attributes. NOTE: Cfradial-unaware. If one of the variables is specified incorrectly in the parameter file, will cause errors
Passthrough when tasks are already provided as a vector of strings
Ronin.process_single_file — Function
###Wrapper version of processsinglefile that allows user to specify a vector of weight matrixes ###In this case will also pass the tasks to complete as a vector ###weight_matrixes are also implicitly the window size
Returns:
-X::Matrix{Float32}: Matrix that is dimensioned (numgates x numfeatures) where numgates is the number of valid (non-missing, meeting NCP/PGG thresholds, non-masked) gates the function finds, and numfeatures is the number of features specified in the argument file to calculate.
-Y::Matrix{Bool} : IF HASINTERACTIVEQC == true, will return Y, array containing 1 if a datapoint was retained during interactive QC, and 0 otherwise. Dimensioned as (num_gates x 1)
-INDEXER::Vector{Bool} : Based on removevariable as described above, contains boolean array specifiying where in the scan features valid data and where does not. Will also contain false where values in `featuremask` are false.
Driver function that calculates a set of features from a single CFRadial file. Features are specified in file located at argfile_path.
Will return a tuple of (X, Y, indexer) where X is the features matrix, Y, a matrix containing the verification
- where human QC determined the gate was meteorological (value of 1), or non-meteorological (value of 0),
and indexer contains a vector of booleans describing which gates met basic quality control thresholds and thus are represented in the X and Y matrixes
Weight matrixes are specified in file header, or passed as explicit argument.
Required arguments
cfrad::NCDataset Input NCDataset containing radar scan variables
tasks::Vector{String} Vector of inpuit features to calculate
Optional keyword arguments
HAS_INTERACTIVE_QC::Bool = falseIf the scan has already had a human apply quality control to it, set to true. Otherwise, false
REMOVE_LOW_SIG_QUALITY::Bool = falseWhether or not to ignore gates that do not meet a minimum NCP/SQI threshold. If true, these gates will be set to false in indexer, and features/verification will not be calculated for them.
SIG_QUALITY_THRESHOLD::Float32 = .2Theshold at or below which to remove data
SIG_QUALITY_VARName of variable in cfradials containing information about signal quality
REMOVE_HIGH_PGG::Bool = falseWhether or not to ignore gates that exceed a given Probability of Ground Gate(PGG) threshold. If true, these gates will be set to false in indexer, and features/verification will not be calculated for them.
PGG_THRESHOLDThreshold at or above which to remove data
QC_variable::String = "VG"Name of a variable in input CFRadial file that has had QC applied to it already. Used to calculate verification Y matrix.
remove_variable::String = "VV" Name of raw variable in input CFRadial file that will be used to determine where missing gates exist in the sweep.
replace_missing::Bool = falseFor spatial parameters, whether or not to replace missings values with FILL_VAL
Returns:
-X::Matrix{Float32}: Matrix that is dimensioned (num_gates x num_features) where num_gates is the number of valid
(non-missing, meeting NCP/PGG thresholds, non-masked) gates the function finds, and num_features is the
number of features specified in the argument file to calculate.
-Y::Matrix{Bool} : IF HAS_INTERACTIVE_QC == true, will return Y, array containing 1 if a datapoint was retained
during interactive QC, and 0 otherwise. Dimensioned as (num_gates x 1)
-INDEXER::Vector{Bool} : Based on remove_variable as described above, contains boolean array specifiying
where in the scan features valid data and where does not. Will also contain `false` where
values in `feature_mask` are false.