Outlier stages
==============

Outlier stages are used to record the number of outliers present in a
given dataset over a specified duration. The DHIS2 Data Quality App can
be used to identify individual outliers, but with the DQ Workbench, you
can summarize the number of outliers and store this as a data value in
DHIS2. This allows you to track the number of outliers over time, and to
visualize this data in the Data Visualizer app or in dashboards.

Defining a new outlier stage
----------------------------

To define a new stage, press the *DQ monitor* link in the left menu, and
then click *New outlier stage*. You will be presented with a form to
fill in the details of the outlier stage.

The form consists of the following fields:

``Stage name``
   The name of the outlier stage.

``Org unit level``
   This is the level where the outlier analysis will be started. You may
   need to experiment with this setting to find the best fit for your
   data. The outlier analysis will be performed on all org units at the
   specified level and below. If you set the level to 1, a single
   request will be made to the DHIS2 API for all org units. This may
   result in a set of results which is too large to be returned by the
   API. If you set the level to 2, the outlier analysis will make a
   request for each org unit at level 2 (including all of its children).
   Thus, you can set the org unit level at a lower level to increase the
   number of requests made to the DHIS2 API (which in turn should reduce
   the overall number of individual outlier results returned), but this
   will also increase the time it takes to run the outlier analysis.

``Duration``
   The duration of the outlier stage. This is used to determine the start
   and end dates of the outlier analysis relative to the current date.
   The duration can be specified in days, weeks, months, or years with
   the format ``12 months``, ``3 weeks``, ``7 days``, etc. The type of
   period used here should match the period type of the data set used
   for the outlier analysis. For example, if the data set uses a monthly
   period type, you should use a duration of ``12 months`` to analyze the
   last 12 months of data. If you use a duration of ``3 weeks``, the
   outlier analysis will only analyze the last 3 weeks of data, which
   may not be sufficient to detect outliers in a monthly data set.

``Data start date offset``
   The data start date offset is used to determine the start date of the
   data which will be used for the outlier analysis.

``Data end date offset``
   The data end date offset is used to determine the end date of the data
   which will be used for the outlier analysis. You may want to exclude
   the current month from the outlier analysis, in which case you can set
   the data end date offset to ``1 month``. This will ensure that the
   outlier analysis only uses data from the previous months, and not the
   current month. This is useful if you want to analyze the data for the
   previous months and not include the current month, as the data for the
   current month may not be complete yet.

``Data set``
   The data set to use for the outlier analysis.

``Algorithm``
   The algorithm to use for the outlier analysis. The available
   algorithms are:

   - ``MOD_Z_SCORE``: Modified Z-score method.
   - ``MIN_MAX``: Determines outliers based on the minimum and maximum
     values.
   - ``Z_SCORE``: Z-score method.
   - ``INVALID_NUMERIC``: Identifies invalid numeric values as outliers.
     These typically correspond to very large values which cannot be
     cast to numbers in the PostgreSQL database.

``Threshold``
   The threshold to use for the outlier analysis. This parameter is only
   relevant for the ``MOD_Z_SCORE`` and ``Z_SCORE`` algorithms. When the
   Z-score is above the threshold, the value is considered an outlier.

``Destination data element``
   The data element used to store the number of outliers detected by the
   outlier stage.

``Active``
   Whether the outlier stage is active or not. If the outlier stage is
   not active, it will be excluded when running the outlier analysis
   with the command line script.