validation

`measurement.validation` ¤

Classes¤

`Measurement` ¤

Bases: MeasurementBase

Class to store the measurements and their validation status.

This class holds the value of a given variable and station at a specific time, as well as auxiliary information such as maximum and minimum values, depth and direction, for vector quantities. All of these have a raw version where a backup of the original data is kept, should this change at any point.

Flags to monitor its validation status, if the data is active (and therefore can be used for reporting) and if it has actually been used for that is also included.

Attributes:

Name	Type	Description
`depth`	`int`	Depth of the measurement.
`direction`	`Decimal`	Direction of the measurement, useful for vector quantities.
`raw_value`	`Decimal`	Original value of the measurement.
`raw_maximum`	`Decimal`	Original maximum value of the measurement.
`raw_minimum`	`Decimal`	Original minimum value of the measurement.
`raw_direction`	`Decimal`	Original direction of the measurement.
`raw_depth`	`int`	Original depth of the measurement.
`is_validated`	`bool`	Flag to indicate if the measurement has been validated.
`is_active`	`bool`	Flag to indicate if the measurement is active. An inactive measurement is not used for reporting

Attributes¤

`overwritten: bool` `property` ¤

Indicates if any of the values associated to the entry have been overwritten.

Returns:

Name	Type	Description
`bool`	`bool`	True if any raw field is different to the corresponding standard field.

`raws: tuple[str, ...]` `property` ¤

Return the raw fields of the measurement.

Returns:

Type	Description
`tuple[str, ...]`	tuple[str]: Tuple with the names of the raw fields of the measurement.

Functions¤

`clean()` ¤

Check consistency of validation, reporting and backs-up values.

Source code in measurement\models.py

def clean(self) -> None:
    """Check consistency of validation, reporting and backs-up values."""
    # Check consistency of validation
    if not self.is_validated and not self.is_active:
        raise ValidationError("Only validated entries can be declared as inactive.")

    # Backup values to raws, if needed
    for r in self.raws:
        value = getattr(self, r.removeprefix("raw_"))
        if value and not getattr(self, r):
            setattr(self, r, value)

`Variable` ¤

Bases: PermissionsBase

A variable with a physical meaning.

Such as precipitation, wind speed, wind direction, soil moisture, including the associated unit. It also includes metadata to help identify what is a reasonable value for the data, to flag outliers and to help with the validation process.

The nature of the variable can be one of the following:

sum: Cumulative value over a period of time.
average: Average value over a period of time.
value: One-off value.

Attributes:

Name	Type	Description
`variable_id`	`AutoField`	Primary key.
`variable_code`	`CharField`	Code of the variable, eg. airtemperature.
`name`	`CharField`	Human-readable name of the variable, eg. Air temperature.
`unit`	`ForeignKey`	Unit of the variable.
`maximum`	`DecimalField`	Maximum value allowed for the variable.
`minimum`	`DecimalField`	Minimum value allowed for the variable.
`diff_error`	`DecimalField`	If two sequential values in the time-series data of this variable differ by more than this value, the validation process can mark this with an error flag.
`outlier_limit`	`DecimalField`	The statistical deviation for defining outliers, in times the standard deviation (sigma).
`null_limit`	`DecimalField`	The max % of null values (missing, caused by e.g. equipment malfunction) allowed for hourly, daily, monthly data. Cumulative values are not deemed trustworthy if the number of missing values in a given period is greater than the null_limit.
`nature`	`CharField`	Nature of the variable, eg. if it represents a one-off value, the average over a period of time or the cumulative value over a period

Attributes¤

`is_cumulative: bool` `property` ¤

Return True if the nature of the variable is sum.

Functions¤

`str()` ¤

Return the string representation of the object.

Source code in variable\models.py

def __str__(self) -> str:
    """Return the string representation of the object."""
    return str(self.name)

`clean()` ¤

Validate the model fields.

Source code in variable\models.py

def clean(self) -> None:
    """Validate the model fields."""
    if self.maximum < self.minimum:
        raise ValidationError(
            {
                "maximum": "The maximum value must be greater than the minimum "
                "value."
            }
        )
    if not self.variable_code.isidentifier():
        raise ValidationError(
            {
                "variable_code": "The variable code must be a valid Python "
                "identifier. Only letters, numbers and underscores are allowed, and"
                " it cannot start with a number."
            }
        )
    return super().clean()

`get_absolute_url()` ¤

Get the absolute URL of the object.

Source code in variable\models.py

def get_absolute_url(self) -> str:
    """Get the absolute URL of the object."""
    return reverse("variable:variable_detail", kwargs={"pk": self.pk})

Functions¤

`flag_suspicious_daily_count(data, null_limit)` ¤

Finds suspicious records count for daily data.

Parameters:

Name	Type	Description	Default
`data`	`Series`	The count of records per day.	required
`null_limit`	`Decimal`	The percentage of null data allowed.	required

Returns: A dataframe with the suspicious data.

Source code in measurement\validation.py

def flag_suspicious_daily_count(data: pd.Series, null_limit: Decimal) -> pd.DataFrame:
    """Finds suspicious records count for daily data.

    Args:
        data: The count of records per day.
        null_limit: The percentage of null data allowed.
    Returns:
        A dataframe with the suspicious data.
    """
    expected_data_count = data.mode().iloc[0]

    suspicious = pd.DataFrame(index=data.index)
    suspicious["daily_count_fraction"] = (data / expected_data_count).round(2)

    suspicious["suspicious_daily_count"] = (
        suspicious["daily_count_fraction"] < 1 - float(null_limit) / 100
    ) | (suspicious["daily_count_fraction"] > 1)

    return suspicious

`flag_suspicious_data(data, maximum, minimum, allowed_difference)` ¤

Finds suspicious data in the database.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	The dataframe with the data to be evaluated.	required
`maximum`	`Decimal`	The maximum allowed value.	required
`minimum`	`Decimal`	The minimum allowed value.	required
`allowed_difference`	`Decimal`	The allowed difference between the measurements.	required

Returns:

Type	Description
`DataFrame`	A dataframe with the suspicious data.

Source code in measurement\validation.py

def flag_suspicious_data(
    data: pd.DataFrame,
    maximum: Decimal,
    minimum: Decimal,
    allowed_difference: Decimal,
) -> pd.DataFrame:
    """Finds suspicious data in the database.

    Args:
        data: The dataframe with the data to be evaluated.
        maximum: The maximum allowed value.
        minimum: The minimum allowed value.
        allowed_difference: The allowed difference between the measurements.

    Returns:
        A dataframe with the suspicious data.
    """
    time_lapse = flag_time_lapse_status(data)
    value_difference = flag_value_difference(data, allowed_difference)
    value_limits = flag_value_limits(data, maximum, minimum)
    return pd.concat([time_lapse, value_difference, value_limits], axis=1)

`flag_time_lapse_status(data)` ¤

Flags if period of the time entries is correct.

It is assumes that the first entry is correct. A tolerance of 2% of the period is used when deciding on the suspicious status. The period is the mode of the time differences.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	The dataframe with allowed_difference = Variable. the data.	required

Returns:

Type	Description
`Series`	A series with the status of the time lapse.

Source code in measurement\validation.py

def flag_time_lapse_status(data: pd.DataFrame) -> pd.Series:
    """Flags if period of the time entries is correct.

    It is assumes that the first entry is correct. A tolerance of 2% of the period
    is used when deciding on the suspicious status. The period is the mode of the
    time differences.

    Args:
        data: The dataframe with allowed_difference = Variable. the data.

    Returns:
        A series with the status of the time lapse.
    """
    period = data.time.diff().mode().iloc[0]
    flags = pd.DataFrame(index=data.index, columns=["suspicious_time_lapse"])
    low = pd.Timedelta(f"{period}min") * (1 - 0.02)
    high = pd.Timedelta(f"{period}min") * (1 + 0.02)
    flags["suspicious_time_lapse"] = ~data.time.diff().between(
        low, high, inclusive="both"
    )
    flags["suspicious_time_lapse"].iloc[0] = False
    return flags

`flag_value_difference(data, allowed_difference)` ¤

Flags if the differences in value of the measurements is correct.

It is assume that the first entry is correct.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	The dataframe with allowed_difference = Variable. the data.	required
`allowed_difference`	`Decimal`	The allowed difference between the measurements.	required

Returns:

Type	Description
`Series`	A series with the status of the value.

Source code in measurement\validation.py

def flag_value_difference(data: pd.DataFrame, allowed_difference: Decimal) -> pd.Series:
    """Flags if the differences in value of the measurements is correct.

    It is assume that the first entry is correct.

    Args:
        data: The dataframe with allowed_difference = Variable. the data.
        allowed_difference: The allowed difference between the measurements.

    Returns:
        A series with the status of the value.
    """
    flags = pd.DataFrame(index=data.index, columns=["suspicious_value_difference"])
    flags["suspicious_value_difference"] = data["value"].diff().abs() > float(
        allowed_difference
    )
    flags["suspicious_value_difference"].iloc[0] = False
    return flags

`flag_value_limits(data, maximum, minimum)` ¤

Flags if the values and limits of the measurements are within limits.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	The dataframe with allowed_difference = Variable. the data.	required
`maximum`	`Decimal`	The maximum allowed value.	required
`minimum`	`Decimal`	The minimum allowed value.	required

Returns:

Type	Description
`DataFrame`	A dataframe with suspicious columns indicating a problem.

Source code in measurement\validation.py

def flag_value_limits(
    data: pd.DataFrame, maximum: Decimal, minimum: Decimal
) -> pd.DataFrame:
    """Flags if the values and limits of the measurements are within limits.

    Args:
        data: The dataframe with allowed_difference = Variable. the data.
        maximum: The maximum allowed value.
        minimum: The minimum allowed value.

    Returns:
        A dataframe with suspicious columns indicating a problem.
    """
    flags = pd.DataFrame(index=data.index)
    flags["suspicious_value_limits"] = (data["value"] < minimum) | (
        data["value"] > maximum
    )
    if "maximum" in data.columns:
        flags["suspicious_maximum_limits"] = (data["maximum"] < minimum) | (
            data["maximum"] > maximum
        )
    if "minimum" in data.columns:
        flags["suspicious_minimum_limits"] = (data["minimum"] < minimum) | (
            data["minimum"] > maximum
        )
    return flags

`generate_daily_summary(data, suspicious, null_limit, is_cumulative)` ¤

Generates a daily report of the data.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	The dataframe with the data to be evaluated.	required
`suspicious`	`DataFrame`	The dataframe with the suspicious data.	required
`null_limit`	`Decimal`	The percentage of null data allowed.	required
`is_cumulative`	`bool`	If the data is cumulative and should be aggregated by sum.	required

Returns:

Type	Description
`DataFrame`	A dataframe with the daily report.

Source code in measurement\validation.py

def generate_daily_summary(
    data: pd.DataFrame,
    suspicious: pd.DataFrame,
    null_limit: Decimal,
    is_cumulative: bool,
) -> pd.DataFrame:
    """Generates a daily report of the data.

    Args:
        data: The dataframe with the data to be evaluated.
        suspicious: The dataframe with the suspicious data.
        null_limit: The percentage of null data allowed.
        is_cumulative: If the data is cumulative and should be aggregated by sum.

    Returns:
        A dataframe with the daily report.
    """
    report = pd.DataFrame(index=data.time.dt.date.unique())

    # Group the data by day and calculate the mean or sum
    datagroup = data.groupby(data.time.dt.date)
    report["value"] = (
        datagroup["value"].sum() if is_cumulative else datagroup["value"].mean()
    )

    if "maximum" in data.columns:
        report["maximum"] = datagroup["maximum"].max()
    if "minimum" in data.columns:
        report["minimum"] = datagroup["minimum"].min()

    # Count the number of entries per day and flag the suspicious ones
    count_report = flag_suspicious_daily_count(datagroup["value"].count(), null_limit)

    # Group the suspicious data by day and calculate the sum
    suspiciousgroup = suspicious.groupby(data.time.dt.date)
    suspicious_report = suspiciousgroup.sum().astype(int)
    suspicious_report["total_suspicious_entries"] = suspicious_report.sum(axis=1)

    # Put together the final report
    report = pd.concat([report, suspicious_report, count_report], axis=1)
    report = report.sort_index().reset_index().rename(columns={"index": "date"})
    report.date = pd.to_datetime(report.date)
    return report

`generate_validation_report(station, variable, start_time, end_time, maximum, minimum, is_validated=False)` ¤

Generates a report of the data.

Parameters:

Name	Type	Description	Default
`station`	`str`	Station of interest.	required
`variable`	`str`	Variable of interest.	required
`start_time`	`str`	Start time.	required
`end_time`	`str`	End time.	required
`maximum`	`Decimal`	The maximum allowed value.	required
`minimum`	`Decimal`	The minimum allowed value.	required
`is_validated`	`bool`	Whether to retrieve validated or non-validated data.	`False`

Returns:

Type	Description
`tuple[DataFrame, DataFrame]`	A tuple with the summary report and the granular report.

Source code in measurement\validation.py

def generate_validation_report(
    station: str,
    variable: str,
    start_time: str,
    end_time: str,
    maximum: Decimal,
    minimum: Decimal,
    is_validated: bool = False,
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Generates a report of the data.

    Args:
        station: Station of interest.
        variable: Variable of interest.
        start_time: Start time.
        end_time: End time.
        maximum: The maximum allowed value.
        minimum: The minimum allowed value.
        is_validated: Whether to retrieve validated or non-validated data.

    Returns:
        A tuple with the summary report and the granular report.
    """
    var = Variable.objects.get(variable_code=variable)

    data = get_data_to_validate(station, variable, start_time, end_time, is_validated)
    if data.empty:
        return pd.DataFrame(), pd.DataFrame()

    suspicious = flag_suspicious_data(data, maximum, minimum, var.diff_error)
    summary = generate_daily_summary(
        data, suspicious, var.null_limit, var.is_cumulative
    )
    granular = pd.concat([data, suspicious], axis=1)
    return summary, granular

`get_data_to_validate(station, variable, start_time, end_time, is_validated=False)` ¤

Retrieves data to be validated.

Parameters:

Name	Type	Description	Default
`station`	`str`	Station of interest.	required
`variable`	`str`	Variable of interest.	required
`start_time`	`str`	Start time.	required
`end_time`	`str`	End time.	required
`is_validated`	`bool`	Whether to retrieve validated or non-validated data.	`False`

Returns:

Type	Description
`DataFrame`	A dictionary with the report for the chosen days.

Source code in measurement\validation.py

def get_data_to_validate(
    station: str,
    variable: str,
    start_time: str,
    end_time: str,
    is_validated: bool = False,
) -> pd.DataFrame:
    """Retrieves data to be validated.

    Args:
        station: Station of interest.
        variable: Variable of interest.
        start_time: Start time.
        end_time: End time.
        is_validated: Whether to retrieve validated or non-validated data.

    Returns:
        A dictionary with the report for the chosen days.
    """
    tz = timezone.get_current_timezone()
    start_time_ = datetime.strptime(start_time, "%Y-%m-%d").replace(tzinfo=tz)
    end_time_ = datetime.strptime(end_time, "%Y-%m-%d").replace(tzinfo=tz)

    df = pd.DataFrame.from_records(
        Measurement.objects.filter(
            station__station_code=station,
            variable__variable_code=variable,
            time__date__range=(start_time_.date(), end_time_.date()),
            is_validated=is_validated,
        ).values()
    )

    if df.empty:
        return df

    df["time"] = df["time"].dt.tz_convert(tz)
    return df.sort_values("time")

`reset_validated_days(station, variable, start_date, end_date)` ¤

Resets validation and active status for the selected data.

It also deletes the associated report data.

TODO: should this also reset any modified value, minimum or maximum entries?

Parameters:

Name	Type	Description	Default
`station`	`str`	Station code	required
`variable`	`str`	Variable code	required
`start_date`	`str`	Start date	required
`end_date`	`str`	End date	required

Source code in measurement\validation.py

def reset_validated_days(
    station: str, variable: str, start_date: str, end_date: str
) -> None:
    """Resets validation and active status for the selected data.

    It also deletes the associated report data.

    TODO: should this also reset any modified value, minimum or maximum entries?

    Args:
        station (str): Station code
        variable (str): Variable code
        start_date (str): Start date
        end_date (str): End date
    """
    tz = timezone.get_current_timezone()

    # To update we use the exact date range.
    start_date_ = datetime.strptime(start_date, "%Y-%m-%d").replace(tzinfo=tz)
    end_date_ = datetime.strptime(end_date, "%Y-%m-%d").replace(tzinfo=tz)
    Measurement.objects.filter(
        station__station_code=station,
        variable__variable_code=variable,
        time__date__range=(start_date_.date(), end_date_.date()),
    ).update(is_validated=False, is_active=True)

    # To remove reports we use an extended date range to include the whole month.
    start_date_, end_date_ = reporting.reformat_dates(start_date, end_date)
    reporting.remove_report_data_in_range(station, variable, start_date_, end_date_)

`reset_validated_entries(ids)` ¤

Resets validation and activation status for the selected data.

TODO: should this also reset any modified value, minimum or maximum entries?

Parameters:

Name	Type	Description	Default
`ids`	`list`	List of measurement ids to reset.	required

Source code in measurement\validation.py

def reset_validated_entries(ids: list) -> None:
    """Resets validation and activation status for the selected data.

    TODO: should this also reset any modified value, minimum or maximum entries?

    Args:
        ids (list): List of measurement ids to reset.
    """
    times: list[datetime] = []
    for _id in ids:
        current = Measurement.objects.get(id=_id)
        current.is_validated = False
        current.is_active = True
        current.save()
        times.append(current.time)

    station = current.station.station_code
    variable = current.variable.variable_code
    start_time, end_time = reporting.reformat_dates(
        to_local_time(min(times)).strftime("%Y-%m-%d"),
        to_local_time(max(times)).strftime("%Y-%m-%d"),
    )

    reporting.remove_report_data_in_range(station, variable, start_time, end_time)

`save_validated_days(data)` ¤

Saves the validated days to the database and launches the report calculation.

Only the data that is flagged as "validate?" will be saved. The only updated field is is_active. To update the value, maximum or minimum, use save_validated_entries.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	The dataframe with the validated data.	required

Source code in measurement\validation.py

def save_validated_days(data: pd.DataFrame) -> None:
    """Saves the validated days to the database and launches the report calculation.

    Only the data that is flagged as "validate?" will be saved. The only updated field
    is is_active. To update the value, maximum or minimum, use save_validated_entries.

    Args:
        data: The dataframe with the validated data.
    """
    tz = timezone.get_current_timezone()
    validate = data[data["validate?"]]
    for _, row in validate.iterrows():
        day = datetime.strptime(row["date"], "%Y-%m-%d").replace(tzinfo=tz)
        Measurement.objects.filter(
            station__station_code=row["station"],
            variable__variable_code=row["variable"],
            time__date=day.date(),
        ).update(is_validated=True, is_active=not row["deactivate?"])

    station = validate["station"].iloc[0]
    variable = validate["variable"].iloc[0]
    start_time = validate["date"].min()
    end_time = validate["date"].max()

    try:
        reporting.launch_reports_calculation(station, variable, start_time, end_time)
    except Exception as e:
        reset_validated_days(station, variable, start_time, end_time)
        raise e

`save_validated_entries(data)` ¤

Saves the validated data to the database.

Only the data that is flagged as "validate?" will be saved. Possible updated fields are: value, maximum, minimum and is_active.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	The dataframe with the validated data.	required

Source code in measurement\validation.py

def save_validated_entries(data: pd.DataFrame) -> None:
    """Saves the validated data to the database.

    Only the data that is flagged as "validate?" will be saved. Possible updated fields
    are: value, maximum, minimum and is_active.

    Args:
        data: The dataframe with the validated data.
    """
    times: list[datetime] = []
    for _, row in data[data["validate?"]].iterrows():
        current = Measurement.objects.get(id=row["id"])
        times.append(current.time)

        update = {"is_validated": True, "is_active": not row["deactivate?"]}
        if current.value != row["value"]:
            update["value"] = row["value"]
        if "maximum" in row and current.maximum != row["maximum"]:
            update["maximum"] = row["maximum"]
        if "minimum" in row and current.minimum != row["minimum"]:
            update["minimum"] = row["minimum"]

        Measurement.objects.filter(id=row["id"]).update(**update)

    tz = timezone.get_current_timezone()
    station = current.station.station_code
    variable = current.variable.variable_code
    start_time = min(times).astimezone(tz).strftime("%Y-%m-%d")
    end_time = max(times).astimezone(tz).strftime("%Y-%m-%d")

    try:
        reporting.launch_reports_calculation(station, variable, start_time, end_time)
    except Exception as e:
        ids = data[data["validate?"]]["id"].tolist()
        reset_validated_entries(ids)
        raise e

validation

measurement.validation ¤

Classes¤

Measurement ¤

Attributes¤

overwritten: bool property ¤

raws: tuple[str, ...] property ¤

Functions¤

clean() ¤

Variable ¤

Attributes¤

is_cumulative: bool property ¤

Functions¤

__str__() ¤

clean() ¤

get_absolute_url() ¤

Functions¤

flag_suspicious_daily_count(data, null_limit) ¤

flag_suspicious_data(data, maximum, minimum, allowed_difference) ¤

flag_time_lapse_status(data) ¤

flag_value_difference(data, allowed_difference) ¤

flag_value_limits(data, maximum, minimum) ¤

generate_daily_summary(data, suspicious, null_limit, is_cumulative) ¤

generate_validation_report(station, variable, start_time, end_time, maximum, minimum, is_validated=False) ¤

get_data_to_validate(station, variable, start_time, end_time, is_validated=False) ¤

reset_validated_days(station, variable, start_date, end_date) ¤

reset_validated_entries(ids) ¤

save_validated_days(data) ¤

save_validated_entries(data) ¤

`measurement.validation` ¤

`Measurement` ¤

`overwritten: bool` `property` ¤

`raws: tuple[str, ...]` `property` ¤

`clean()` ¤

`Variable` ¤

`is_cumulative: bool` `property` ¤

`str()` ¤

`clean()` ¤

`get_absolute_url()` ¤

`flag_suspicious_daily_count(data, null_limit)` ¤

`flag_suspicious_data(data, maximum, minimum, allowed_difference)` ¤

`flag_time_lapse_status(data)` ¤

`flag_value_difference(data, allowed_difference)` ¤

`flag_value_limits(data, maximum, minimum)` ¤

`generate_daily_summary(data, suspicious, null_limit, is_cumulative)` ¤

`generate_validation_report(station, variable, start_time, end_time, maximum, minimum, is_validated=False)` ¤

`get_data_to_validate(station, variable, start_time, end_time, is_validated=False)` ¤

`reset_validated_days(station, variable, start_date, end_date)` ¤

`reset_validated_entries(ids)` ¤

`save_validated_days(data)` ¤

`save_validated_entries(data)` ¤