Skip to content

validation

measurement.validation ¤

Classes¤

Measurement ¤

Bases: MeasurementBase

Class to store the measurements and their validation status.

This class holds the value of a given variable and station at a specific time, as well as auxiliary information such as maximum and minimum values, depth and direction, for vector quantities. All of these have a raw version where a backup of the original data is kept, should this change at any point.

Flags to monitor its validation status, if the data is active (and therefore can be used for reporting) and if it has actually been used for that is also included.

Attributes:

Name Type Description
depth int

Depth of the measurement.

direction Decimal

Direction of the measurement, useful for vector quantities.

raw_value Decimal

Original value of the measurement.

raw_maximum Decimal

Original maximum value of the measurement.

raw_minimum Decimal

Original minimum value of the measurement.

raw_direction Decimal

Original direction of the measurement.

raw_depth int

Original depth of the measurement.

is_validated bool

Flag to indicate if the measurement has been validated.

is_active bool

Flag to indicate if the measurement is active. An inactive measurement is not used for reporting

Attributes¤
overwritten: bool property ¤

Indicates if any of the values associated to the entry have been overwritten.

Returns:

Name Type Description
bool bool

True if any raw field is different to the corresponding standard field.

raws: tuple[str, ...] property ¤

Return the raw fields of the measurement.

Returns:

Type Description
tuple[str, ...]

tuple[str]: Tuple with the names of the raw fields of the measurement.

Functions¤
clean() ¤

Check consistency of validation, reporting and backs-up values.

Source code in measurement/models.py
259
260
261
262
263
264
265
266
267
268
269
def clean(self) -> None:
    """Check consistency of validation, reporting and backs-up values."""
    # Check consistency of validation
    if not self.is_validated and not self.is_active:
        raise ValidationError("Only validated entries can be declared as inactive.")

    # Backup values to raws, if needed
    for r in self.raws:
        value = getattr(self, r.removeprefix("raw_"))
        if value and not getattr(self, r):
            setattr(self, r, value)

Variable ¤

Bases: PermissionsBase

A variable with a physical meaning.

Such as precipitation, wind speed, wind direction, soil moisture, including the associated unit. It also includes metadata to help identify what is a reasonable value for the data, to flag outliers and to help with the validation process.

The nature of the variable can be one of the following:

  • sum: Cumulative value over a period of time.
  • average: Average value over a period of time.
  • value: One-off value.

Attributes:

Name Type Description
variable_id AutoField

Primary key.

variable_code CharField

Code of the variable, eg. airtemperature.

name CharField

Human-readable name of the variable, eg. Air temperature.

unit ForeignKey

Unit of the variable.

maximum DecimalField

Maximum value allowed for the variable.

minimum DecimalField

Minimum value allowed for the variable.

diff_error DecimalField

If two sequential values in the time-series data of this variable differ by more than this value, the validation process can mark this with an error flag.

outlier_limit DecimalField

The statistical deviation for defining outliers, in times the standard deviation (sigma).

null_limit DecimalField

The max % of null values (missing, caused by e.g. equipment malfunction) allowed for hourly, daily, monthly data. Cumulative values are not deemed trustworthy if the number of missing values in a given period is greater than the null_limit.

nature CharField

Nature of the variable, eg. if it represents a one-off value, the average over a period of time or the cumulative value over a period

Attributes¤
is_cumulative: bool property ¤

Return True if the nature of the variable is sum.

Functions¤
__str__() ¤

Return the string representation of the object.

Source code in variable/models.py
165
166
167
def __str__(self) -> str:
    """Return the string representation of the object."""
    return str(self.name)
clean() ¤

Validate the model fields.

Source code in variable/models.py
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
def clean(self) -> None:
    """Validate the model fields."""
    if self.maximum < self.minimum:
        raise ValidationError(
            {
                "maximum": "The maximum value must be greater than the minimum "
                "value."
            }
        )
    if not self.variable_code.isidentifier():
        raise ValidationError(
            {
                "variable_code": "The variable code must be a valid Python "
                "identifier. Only letters, numbers and underscores are allowed, and"
                " it cannot start with a number."
            }
        )
    return super().clean()
get_absolute_url() ¤

Get the absolute URL of the object.

Source code in variable/models.py
169
170
171
def get_absolute_url(self) -> str:
    """Get the absolute URL of the object."""
    return reverse("variable:variable_detail", kwargs={"pk": self.pk})

Functions¤

flag_suspicious_daily_count(data, null_limit) ¤

Finds suspicious records count for daily data.

Parameters:

Name Type Description Default
data Series

The count of records per day.

required
null_limit Decimal

The percentage of null data allowed.

required

Returns: A dataframe with the suspicious data.

Source code in measurement/validation.py
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
def flag_suspicious_daily_count(data: pd.Series, null_limit: Decimal) -> pd.DataFrame:
    """Finds suspicious records count for daily data.

    Args:
        data: The count of records per day.
        null_limit: The percentage of null data allowed.
    Returns:
        A dataframe with the suspicious data.
    """
    expected_data_count = data.mode().iloc[0]

    suspicious = pd.DataFrame(index=data.index)
    suspicious["daily_count_fraction"] = (data / expected_data_count).round(2)

    suspicious["suspicious_daily_count"] = (
        suspicious["daily_count_fraction"] < 1 - float(null_limit) / 100
    ) | (suspicious["daily_count_fraction"] > 1)

    return suspicious

flag_suspicious_data(data, maximum, minimum, allowed_difference) ¤

Finds suspicious data in the database.

Parameters:

Name Type Description Default
data DataFrame

The dataframe with the data to be evaluated.

required
maximum Decimal

The maximum allowed value.

required
minimum Decimal

The minimum allowed value.

required
allowed_difference Decimal

The allowed difference between the measurements.

required

Returns:

Type Description
DataFrame

A dataframe with the suspicious data.

Source code in measurement/validation.py
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
def flag_suspicious_data(
    data: pd.DataFrame,
    maximum: Decimal,
    minimum: Decimal,
    allowed_difference: Decimal,
) -> pd.DataFrame:
    """Finds suspicious data in the database.

    Args:
        data: The dataframe with the data to be evaluated.
        maximum: The maximum allowed value.
        minimum: The minimum allowed value.
        allowed_difference: The allowed difference between the measurements.

    Returns:
        A dataframe with the suspicious data.
    """
    time_lapse = flag_time_lapse_status(data)
    value_difference = flag_value_difference(data, allowed_difference)
    value_limits = flag_value_limits(data, maximum, minimum)
    return pd.concat([time_lapse, value_difference, value_limits], axis=1)

flag_time_lapse_status(data) ¤

Flags if period of the time entries is correct.

It is assumes that the first entry is correct. A tolerance of 2% of the period is used when deciding on the suspicious status. The period is the mode of the time differences.

Parameters:

Name Type Description Default
data DataFrame

The dataframe with allowed_difference = Variable. the data.

required

Returns:

Type Description
Series

A series with the status of the time lapse.

Source code in measurement/validation.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
def flag_time_lapse_status(data: pd.DataFrame) -> pd.Series:
    """Flags if period of the time entries is correct.

    It is assumes that the first entry is correct. A tolerance of 2% of the period
    is used when deciding on the suspicious status. The period is the mode of the
    time differences.

    Args:
        data: The dataframe with allowed_difference = Variable. the data.

    Returns:
        A series with the status of the time lapse.
    """
    period = data.time.diff().mode().iloc[0]
    flags = pd.DataFrame(index=data.index, columns=["suspicious_time_lapse"])
    low = pd.Timedelta(f"{period}min") * (1 - 0.02)
    high = pd.Timedelta(f"{period}min") * (1 + 0.02)
    flags["suspicious_time_lapse"] = ~data.time.diff().between(
        low, high, inclusive="both"
    )
    flags["suspicious_time_lapse"].iloc[0] = False
    return flags

flag_value_difference(data, allowed_difference) ¤

Flags if the differences in value of the measurements is correct.

It is assume that the first entry is correct.

Parameters:

Name Type Description Default
data DataFrame

The dataframe with allowed_difference = Variable. the data.

required
allowed_difference Decimal

The allowed difference between the measurements.

required

Returns:

Type Description
Series

A series with the status of the value.

Source code in measurement/validation.py
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
def flag_value_difference(data: pd.DataFrame, allowed_difference: Decimal) -> pd.Series:
    """Flags if the differences in value of the measurements is correct.

    It is assume that the first entry is correct.

    Args:
        data: The dataframe with allowed_difference = Variable. the data.
        allowed_difference: The allowed difference between the measurements.

    Returns:
        A series with the status of the value.
    """
    flags = pd.DataFrame(index=data.index, columns=["suspicious_value_difference"])
    flags["suspicious_value_difference"] = data["value"].diff().abs() > float(
        allowed_difference
    )
    flags["suspicious_value_difference"].iloc[0] = False
    return flags

flag_value_limits(data, maximum, minimum) ¤

Flags if the values and limits of the measurements are within limits.

Parameters:

Name Type Description Default
data DataFrame

The dataframe with allowed_difference = Variable. the data.

required
maximum Decimal

The maximum allowed value.

required
minimum Decimal

The minimum allowed value.

required

Returns:

Type Description
DataFrame

A dataframe with suspicious columns indicating a problem.

Source code in measurement/validation.py
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
def flag_value_limits(
    data: pd.DataFrame, maximum: Decimal, minimum: Decimal
) -> pd.DataFrame:
    """Flags if the values and limits of the measurements are within limits.

    Args:
        data: The dataframe with allowed_difference = Variable. the data.
        maximum: The maximum allowed value.
        minimum: The minimum allowed value.

    Returns:
        A dataframe with suspicious columns indicating a problem.
    """
    flags = pd.DataFrame(index=data.index)
    flags["suspicious_value_limits"] = (data["value"] < minimum) | (
        data["value"] > maximum
    )
    if "maximum" in data.columns:
        flags["suspicious_maximum_limits"] = (data["maximum"] < minimum) | (
            data["maximum"] > maximum
        )
    if "minimum" in data.columns:
        flags["suspicious_minimum_limits"] = (data["minimum"] < minimum) | (
            data["minimum"] > maximum
        )
    return flags

generate_daily_summary(data, suspicious, null_limit, is_cumulative) ¤

Generates a daily report of the data.

Parameters:

Name Type Description Default
data DataFrame

The dataframe with the data to be evaluated.

required
suspicious DataFrame

The dataframe with the suspicious data.

required
null_limit Decimal

The percentage of null data allowed.

required
is_cumulative bool

If the data is cumulative and should be aggregated by sum.

required

Returns:

Type Description
DataFrame

A dataframe with the daily report.

Source code in measurement/validation.py
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
def generate_daily_summary(
    data: pd.DataFrame,
    suspicious: pd.DataFrame,
    null_limit: Decimal,
    is_cumulative: bool,
) -> pd.DataFrame:
    """Generates a daily report of the data.

    Args:
        data: The dataframe with the data to be evaluated.
        suspicious: The dataframe with the suspicious data.
        null_limit: The percentage of null data allowed.
        is_cumulative: If the data is cumulative and should be aggregated by sum.

    Returns:
        A dataframe with the daily report.
    """
    report = pd.DataFrame(index=data.time.dt.date.unique())

    # Group the data by day and calculate the mean or sum
    datagroup = data.groupby(data.time.dt.date)
    report["value"] = (
        datagroup["value"].sum() if is_cumulative else datagroup["value"].mean()
    )

    if "maximum" in data.columns:
        report["maximum"] = datagroup["maximum"].max()
    if "minimum" in data.columns:
        report["minimum"] = datagroup["minimum"].min()

    # Count the number of entries per day and flag the suspicious ones
    count_report = flag_suspicious_daily_count(datagroup["value"].count(), null_limit)

    # Group the suspicious data by day and calculate the sum
    suspiciousgroup = suspicious.groupby(data.time.dt.date)
    suspicious_report = suspiciousgroup.sum().astype(int)
    suspicious_report["total_suspicious_entries"] = suspicious_report.sum(axis=1)

    # Put together the final report
    report = pd.concat([report, suspicious_report, count_report], axis=1)
    report = report.sort_index().reset_index().rename(columns={"index": "date"})
    report.date = pd.to_datetime(report.date)
    return report

generate_validation_report(station, variable, start_time, end_time, maximum, minimum, is_validated=False) ¤

Generates a report of the data.

Parameters:

Name Type Description Default
station str

Station of interest.

required
variable str

Variable of interest.

required
start_time str

Start time.

required
end_time str

End time.

required
maximum Decimal

The maximum allowed value.

required
minimum Decimal

The minimum allowed value.

required
is_validated bool

Whether to retrieve validated or non-validated data.

False

Returns:

Type Description
tuple[DataFrame, DataFrame]

A tuple with the summary report and the granular report.

Source code in measurement/validation.py
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
def generate_validation_report(
    station: str,
    variable: str,
    start_time: str,
    end_time: str,
    maximum: Decimal,
    minimum: Decimal,
    is_validated: bool = False,
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Generates a report of the data.

    Args:
        station: Station of interest.
        variable: Variable of interest.
        start_time: Start time.
        end_time: End time.
        maximum: The maximum allowed value.
        minimum: The minimum allowed value.
        is_validated: Whether to retrieve validated or non-validated data.

    Returns:
        A tuple with the summary report and the granular report.
    """
    var = Variable.objects.get(variable_code=variable)

    data = get_data_to_validate(station, variable, start_time, end_time, is_validated)
    if data.empty:
        return pd.DataFrame(), pd.DataFrame()

    suspicious = flag_suspicious_data(data, maximum, minimum, var.diff_error)
    summary = generate_daily_summary(
        data, suspicious, var.null_limit, var.is_cumulative
    )
    granular = pd.concat([data, suspicious], axis=1)
    return summary, granular

get_data_to_validate(station, variable, start_time, end_time, is_validated=False) ¤

Retrieves data to be validated.

Parameters:

Name Type Description Default
station str

Station of interest.

required
variable str

Variable of interest.

required
start_time str

Start time.

required
end_time str

End time.

required
is_validated bool

Whether to retrieve validated or non-validated data.

False

Returns:

Type Description
DataFrame

A dictionary with the report for the chosen days.

Source code in measurement/validation.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def get_data_to_validate(
    station: str,
    variable: str,
    start_time: str,
    end_time: str,
    is_validated: bool = False,
) -> pd.DataFrame:
    """Retrieves data to be validated.

    Args:
        station: Station of interest.
        variable: Variable of interest.
        start_time: Start time.
        end_time: End time.
        is_validated: Whether to retrieve validated or non-validated data.

    Returns:
        A dictionary with the report for the chosen days.
    """
    tz = timezone.get_current_timezone()
    start_time_ = datetime.strptime(start_time, "%Y-%m-%d").replace(tzinfo=tz)
    end_time_ = datetime.strptime(end_time, "%Y-%m-%d").replace(tzinfo=tz)

    df = pd.DataFrame.from_records(
        Measurement.objects.filter(
            station__station_code=station,
            variable__variable_code=variable,
            time__date__range=(start_time_.date(), end_time_.date()),
            is_validated=is_validated,
        ).values()
    )

    if df.empty:
        return df

    df["time"] = df["time"].dt.tz_convert(tz)
    return df.sort_values("time")

reset_validated_days(station, variable, start_date, end_date) ¤

Resets validation and active status for the selected data.

It also deletes the associated report data.

TODO: should this also reset any modified value, minimum or maximum entries?

Parameters:

Name Type Description Default
station str

Station code

required
variable str

Variable code

required
start_date str

Start date

required
end_date str

End date

required
Source code in measurement/validation.py
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
def reset_validated_days(
    station: str, variable: str, start_date: str, end_date: str
) -> None:
    """Resets validation and active status for the selected data.

    It also deletes the associated report data.

    TODO: should this also reset any modified value, minimum or maximum entries?

    Args:
        station (str): Station code
        variable (str): Variable code
        start_date (str): Start date
        end_date (str): End date
    """
    tz = timezone.get_current_timezone()

    # To update we use the exact date range.
    start_date_ = datetime.strptime(start_date, "%Y-%m-%d").replace(tzinfo=tz)
    end_date_ = datetime.strptime(end_date, "%Y-%m-%d").replace(tzinfo=tz)
    Measurement.objects.filter(
        station__station_code=station,
        variable__variable_code=variable,
        time__date__range=(start_date_.date(), end_date_.date()),
    ).update(is_validated=False, is_active=True)

    # To remove reports we use an extended date range to include the whole month.
    start_date_, end_date_ = reporting.reformat_dates(start_date, end_date)
    reporting.remove_report_data_in_range(station, variable, start_date_, end_date_)

reset_validated_entries(ids) ¤

Resets validation and activation status for the selected data.

TODO: should this also reset any modified value, minimum or maximum entries?

Parameters:

Name Type Description Default
ids list

List of measurement ids to reset.

required
Source code in measurement/validation.py
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
def reset_validated_entries(ids: list) -> None:
    """Resets validation and activation status for the selected data.

    TODO: should this also reset any modified value, minimum or maximum entries?

    Args:
        ids (list): List of measurement ids to reset.
    """
    times: list[datetime] = []
    for _id in ids:
        current = Measurement.objects.get(id=_id)
        current.is_validated = False
        current.is_active = True
        current.save()
        times.append(current.time)

    station = current.station.station_code
    variable = current.variable.variable_code
    start_time, end_time = reporting.reformat_dates(
        to_local_time(min(times)).strftime("%Y-%m-%d"),
        to_local_time(max(times)).strftime("%Y-%m-%d"),
    )

    reporting.remove_report_data_in_range(station, variable, start_time, end_time)

save_validated_days(data) ¤

Saves the validated days to the database and launches the report calculation.

Only the data that is flagged as "validate?" will be saved. The only updated field is is_active. To update the value, maximum or minimum, use save_validated_entries.

Parameters:

Name Type Description Default
data DataFrame

The dataframe with the validated data.

required
Source code in measurement/validation.py
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
def save_validated_days(data: pd.DataFrame) -> None:
    """Saves the validated days to the database and launches the report calculation.

    Only the data that is flagged as "validate?" will be saved. The only updated field
    is is_active. To update the value, maximum or minimum, use save_validated_entries.

    Args:
        data: The dataframe with the validated data.
    """
    tz = timezone.get_current_timezone()
    validate = data[data["validate?"]]
    for _, row in validate.iterrows():
        day = datetime.strptime(row["date"], "%Y-%m-%d").replace(tzinfo=tz)
        Measurement.objects.filter(
            station__station_code=row["station"],
            variable__variable_code=row["variable"],
            time__date=day.date(),
        ).update(is_validated=True, is_active=not row["deactivate?"])

    station = validate["station"].iloc[0]
    variable = validate["variable"].iloc[0]
    start_time = validate["date"].min()
    end_time = validate["date"].max()

    try:
        reporting.launch_reports_calculation(station, variable, start_time, end_time)
    except Exception as e:
        reset_validated_days(station, variable, start_time, end_time)
        raise e

save_validated_entries(data) ¤

Saves the validated data to the database.

Only the data that is flagged as "validate?" will be saved. Possible updated fields are: value, maximum, minimum and is_active.

Parameters:

Name Type Description Default
data DataFrame

The dataframe with the validated data.

required
Source code in measurement/validation.py
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
def save_validated_entries(data: pd.DataFrame) -> None:
    """Saves the validated data to the database.

    Only the data that is flagged as "validate?" will be saved. Possible updated fields
    are: value, maximum, minimum and is_active.

    Args:
        data: The dataframe with the validated data.
    """
    times: list[datetime] = []
    for _, row in data[data["validate?"]].iterrows():
        current = Measurement.objects.get(id=row["id"])
        times.append(current.time)

        update = {"is_validated": True, "is_active": not row["deactivate?"]}
        if current.value != row["value"]:
            update["value"] = row["value"]
        if "maximum" in row and current.maximum != row["maximum"]:
            update["maximum"] = row["maximum"]
        if "minimum" in row and current.minimum != row["minimum"]:
            update["minimum"] = row["minimum"]

        Measurement.objects.filter(id=row["id"]).update(**update)

    tz = timezone.get_current_timezone()
    station = current.station.station_code
    variable = current.variable.variable_code
    start_time = min(times).astimezone(tz).strftime("%Y-%m-%d")
    end_time = max(times).astimezone(tz).strftime("%Y-%m-%d")

    try:
        reporting.launch_reports_calculation(station, variable, start_time, end_time)
    except Exception as e:
        ids = data[data["validate?"]]["id"].tolist()
        reset_validated_entries(ids)
        raise e