pyhydroqc Sensor Data QC: Single Site Example

Checking for non-preferred file/folder path names (may take a long time depending on the number of files/folders) ...

This resource contains some files/folders that have non-preferred characters in their name. Show non-conforming files/folders.

This resource contains content types with files that need to be updated to match with metadata changes. Show content type files that need updating.

pyhydroqc Sensor Data QC: Single Site Example

Authors:
Owners:		This resource does not have an owner who is an active HydroShare user. Contact CUAHSI (help@cuahsi.org) for information on this resource.
Type:	Resource
Storage:	The size of this resource is 1.5 MB
Created:	Dec 09, 2020 at 4:21 a.m. (UTC)
Last updated:	Mar 08, 2022 at midnight (UTC) (Metadata update)
Published date:	Mar 08, 2022 at midnight (UTC)
DOI:	10.4211/hs.92f393cbd06b47c398bdd2bbb86887ac
Citation:	See how to cite this resource

Sharing Status:	Published
Views:	3328
Downloads:	202
+1 Votes:	Be the first one to this.
Comments:	No comments (yet)

Abstract

This resource contains an example script for using the software package pyhydroqc. pyhydroqc was developed to identify and correct anomalous values in time series data collected by in situ aquatic sensors. For more information, see the code repository: https://github.com/AmberSJones/pyhydroqc and the documentation: https://ambersjones.github.io/pyhydroqc/. The package may be installed from the Python Package Index.

This script applies the functions to data from a single site in the Logan River Observatory, which is included in the repository. The data collected in the Logan River Observatory are sourced at http://lrodata.usu.edu/tsa/ or on HydroShare: https://www.hydroshare.org/search/?q=logan%20river%20observatory.

Anomaly detection methods include ARIMA (AutoRegressive Integrated Moving Average) and LSTM (Long Short Term Memory). These are time series regression methods that detect anomalies by comparing model estimates to sensor observations and labeling points as anomalous when they exceed a threshold. There are multiple possible approaches for applying LSTM for anomaly detection/correction.
- Vanilla LSTM: uses past values of a single variable to estimate the next value of that variable.
- Multivariate Vanilla LSTM: uses past values of multiple variables to estimate the next value for all variables.
- Bidirectional LSTM: uses past and future values of a single variable to estimate a value for that variable at the time step of interest.
- Multivariate Bidirectional LSTM: uses past and future values of multiple variables to estimate a value for all variables at the time step of interest.

The correction approach uses piecewise ARIMA models. Each group of consecutive anomalous points is considered as a unit to be corrected. Separate ARIMA models are developed for valid points preceding and following the anomalous group. Model estimates are blended to achieve a correction.

The anomaly detection and correction workflow involves the following steps:
1. Retrieving data
2. Applying rules-based detection to screen data and apply initial corrections
3. Identifying and correcting sensor drift and calibration (if applicable)
4. Developing a model (i.e., ARIMA or LSTM)
5. Applying model to make time series predictions
6. Determining a threshold and detecting anomalies by comparing sensor observations to modeled results
7. Widening the window over which an anomaly is identified
8. Aggregating detections resulting from multiple models
9. Making corrections for anomalous events

Instructions to run the notebook through the CUAHSI JupyterHub:
1. Click "Open with..." at the top of the resource and select the CUAHSI JupyterHub. You may need to sign into CUAHSI JupyterHub using your HydroShare credentials.
2. Select 'Python 3.8 - Scientific' as the server and click Start.
2. From your JupyterHub directory, click on the ExampleNotebook.ipynb file.
3. Execute each cell in the code by clicking the Run button.

Subject Keywords

Coverage

Spatial

Coordinate System/Geographic Projection:

WGS 84 EPSG:4326

Coordinate Units:

Decimal degrees

Place/Area Name:

Logan River at Main Street

Longitude

-111.8352°

Latitude

41.7211°

Temporal

Start Date:
End Date:

Content

Learn more about the BagIt download

Select a file to see file type metadata.

README.md

pyhydroqc Sensor Data QC: Single Site Example

File Organization and Description

Example Notebook: pyhydroqc_example.ipynb

This notebook contains the code to apply functions from anomaly detection and correction methods contained in the pyhydroqc package to data from four sensors (water temperature, specific conductance, pH, dissolved oxygen) at a single site. The script calls functions to perform the following steps:

Retrieve data
Perform rules based anomaly detection and correction
Attempts to identify calibration events, determines calibration gap values, and performs linear drift correction.
Implements model workflow functions for five models (ARIMA, LSTM univaraite, LSTM univariate bidirectional, LSTM multivaraiate, and LSTM multivariate bidirectional). These functions develop models, generate estimates, determine dynamic thresholds, and compare to raw data to detect anomalies.

This application script refers to parameters stored in the parameters file.

Observed Data: MS2017.csv

This file contains time series of raw sensor measurements. The file includes observations of four aquatic variables for a single year (2017) from a single site (Logan River at Main Street). The data for each variable is in a separate column. Columns are:

datetime: Date and time of each observation.
temp: water temperature, degrees C
cond: specific conductance, μS/cm
ph: pH, standard units
do: dissolved oxygen, mg/L

Calibration Dates: MainStreet_cond_calib_dates.csv, MainStreet_ph_calib_dates.csv, MainStreet_do_calib_dates.csv

These files contain a list of dates for sensor calibrations. These dates are used by the script to determine gap values and perform linear drift correction. The dates were determined by technicians records and field notes. There is a separate file for each calibrated sensor (specific conductance (cond), pH (ph), dissolved oxygen (do)). Each file contains the following columns:

start: Start date for the period of drift. This is the first point to be corrected by linear dirft correction.
end: End date for the period of drift corresponding to a calibration event.
gap: When correction was performed by technicians, the data were shifted by this value. This value is not used in the script.

Parameters Reference: parameters.py

For reference, the parameters file includes assignments of parameters called by the script for the models. Parameters are defined specific to each site and sensor. LSTM parameters are consistent across sites and variables. ARIMA hyperparameters are specific to each site/sensor combination. Other parameters are used for rules based anomaly detection, determining dynamic thresholds, and for widening anomalous events.

The code provided in this resource was developed using Python 3.70. The following Python packages are required for running the scripts: pandas 1.1.5, matplotlib 3.4.2, pyhydroqc 0.0.4.

Related Resources

The content of this resource references	Jones, A. S., T. Jones, J. S. Horsburgh (2022). Supporting data and tools for "Toward automating post processing of aquatic sensor data", HydroShare, https://doi.org/10.4211/hs.a6ea89ae20354e39b3c9f1228997e27a,
This resource is referenced by	Jones, A.S., Jones, T.L., Horsburgh, J.S. (2022). Toward automated post processing of aquatic sensor data, Environmental Modelling and Software, https://doi.org/10.1016/j.envsoft.2022.105364

Credits

Funding Agencies

This resource was created using funding from the following sources:

Agency Name	Award Title	Award Number
National Science Foundation	Collaborative Research: Elements: Advancing Data Science and Analytics for Water (DSAW)	1931297

How to Cite

Jones, A. S. (2022). pyhydroqc Sensor Data QC: Single Site Example, HydroShare, https://doi.org/10.4211/hs.92f393cbd06b47c398bdd2bbb86887ac

This resource is shared under the Creative Commons Attribution CC BY.

http://creativecommons.org/licenses/by/4.0/

Comments

There are currently no comments

Notifications (${tasks.length})

pyhydroqc Sensor Data QC: Single Site Example

Abstract

Subject Keywords

Coverage

Spatial

Temporal

Content

README.md

pyhydroqc Sensor Data QC: Single Site Example

File Organization and Description

Example Notebook: pyhydroqc_example.ipynb

Observed Data: MS2017.csv

Calibration Dates: MainStreet_cond_calib_dates.csv, MainStreet_ph_calib_dates.csv, MainStreet_do_calib_dates.csv

Parameters Reference: parameters.py

Related Resources

Credits

Funding Agencies

This resource was created using funding from the following sources:

How to Cite

Comments

New Comment