Checking for non-preferred file/folder path names (may take a long time depending on the number of files/folders) ...

Water Quality Portal (WQX 2.2 Profile Backup) v0.1.0


Authors:
Owners: This resource does not have an owner who is an active HydroShare user. Contact CUAHSI (help@cuahsi.org) for information on this resource.
Type: Resource
Storage: The size of this resource is 39.5 GB
Created: Feb 27, 2025 at 5:09 a.m.
Last updated: Feb 27, 2025 at 5:40 a.m.
Citation: See how to cite this resource
Sharing Status: Public
Views: 112
Downloads: 0
+1 Votes: Be the first one to 
 this.
Comments: No comments (yet)

Abstract

This resource is a backup of the Water Quality Portal (WQP), a database of water quality samples from the U.S. Geological Survey and U.S. Environmental Prorection Agency. This resource includes:

1. An R script (run.R) that downloads the data from WQP web services. 
2. A hiearchical archive of zipped csv files, organized by geographic area (in general, Country/State/County or equivalent)
3. A data dictionary for the Water Quality Portal csv exports

Subject Keywords

Content

README.md

Backup the Water Quality Portal

Description

This is a backup of the Water Quality Portal (Legacy WQX 2.2 profile, including USGS NWIS water quality samples through March 11, 2024, and EPA STORET water quality samples through February 2025) Files are in zipped CSVs organized in a directory by Country (State for the US). In each Country/State.zip file are county directories, in which are separate zipped CSVs corresponding to each WQP "data profile" :

  • Organization Data (organizations.zip)
  • Site Data Only (sites.zip)
  • Project Data (projects.zip)
  • Project Monitoring Location Weighting (weighting.zip)
  • Sample Results (Physical/Chemical) (physChem.zip)
  • Sample Results (Biological) (biological.zip)
  • Sample Results (Narrow) (narrowResult.zip)
  • Sampling Activity (activity.zip)
  • Sampling Activity Metrics (activityMetric.zip)
  • Biological Habitat Metrics (resultDetectionQuantitationLimit.zip)
  • Result Detection Quantitation Limit Data (biologicalMetric.zip)

A data dictionary for all fields for each of these profiles is additionally provided in WQX_Data_Dictionary.zip

The Non-US "Countries" are as follows:

FM (Federated States of Micronesia) CA (Canada) GT (Guatemala) IN (India) LE (Lake Erie) LH (Lake Huron) NI (Nicaragua) OA (Atlantic Ocean) OI (Indian Ocean) OP (Pacific Ocean) QO (Lake Ontario) QS (Lake Superior) MX (Mexico) RM (Marshall Islands) PS (Palau) YT (Mayotte) ZC (Caribbean Sea)

Script

An R script for archiving the Water Quality Portal (WQP), and the resulting files. This script systematically downloads data as zipped csv by the lowest administrative unit possible (typically county, but varies by country) to minimze server timeouts and improve archive indexing, organizing it into a hierarchical directory structure.

  • Downloads data from each WQP Web Service endpoint
  • Handles both countries with and without county-level administrative divisions
  • Creates organized directory structure based on geographic hierarchy
  • Includes retry logic
  • Comprehensive logging and progress tracking
  • Rate limiting to respect API endpoints

Directory Structure

The script creates a hierarchical directory structure based on geographic divisions:

For countries with county systems (US, FM, PS, RM): locations/ ├── US/ │ ├── 06_California/ │ │ ├── 001_Alameda/ │ │ │ ├── sites.zip │ │ │ ├── organizations.zip │ │ │ └── ... │ │ └── 003_Alpine/ │ └── 36_New_York/ └── FM/ └── ...

For countries without county systems: locations/ ├── CA/ │ ├── 01_Alberta/ │ │ ├── sites.zip │ │ ├── organizations.zip │ │ └── ... │ └── 02_British_Columbia/ └── MX/ └── ...

Requirements

  • R >= 4.0.0
  • Required R packages:
  • tidyverse
  • httr
  • fs
  • jsonlite
  • furrr
  • progressr
  • parallelly

Install dependencies: r install.packages(c("tidyverse", "httr", "fs", "jsonlite", "furrr", "progressr", "parallelly"))

Usage

  1. Clone the repository: bash git clone https://github.com/ksonda/wqp-backup.git cd wqp-backup

  2. Run the script: r source("run.R")

The script will: 1. Create the complete directory structure 2. Download data for each endpoint in sequence 3. Process locations in parallel within each endpoint

Configuration

The tool's behavior can be customized by modifying the CONFIG list in the script:

r CONFIG <- list( base_url = "https://www.waterqualitydata.us/data", endpoints = list(...), base_dir = "locations", location_types = list( county_countries = c("US", "FM", "PS", "RM"), state_countries = c("CA", "MX") ), parallel = list( workers = parallelly::availableCores() - 1, # Use all cores except one chunk_size = 100 # Number of locations to process in each chunk ) )

  • base_url: Base URL for the Water Quality Portal API
  • endpoints: List of endpoints and their configurations
  • base_dir: Base directory for downloaded data
  • location_types: Geographic division configurations
  • parallel: Parallel processing settings
  • workers: Number of parallel workers to use
  • chunk_size: Number of locations to process in each chunk

Logging

  • Failed downloads logged to download_errors.log

Rate Limiting

To respect the API's resources: - 1-second delay between requests - Exponential backoff on failures - Maximum of 3 retry attempts per download

License

This project is licensed under the MIT License - see the LICENSE file for details.

References

  • Water Quality Portal. Washington (DC): National Water Quality Monitoring Council, United States Geological Survey (USGS), Environmental Protection Agency (EPA); 2021. https://doi.org/10.5066/P9QRKUVJ.

Related Resources

The content of this resource is derived from Water Quality Portal. Washington (DC): National Water Quality Monitoring Council, United States Geological Survey (USGS), Environmental Protection Agency (EPA); 2021. https://doi.org/10.5066/P9QRKUVJ.

How to Cite

Onda, K. (2025). Water Quality Portal (WQX 2.2 Profile Backup) v0.1.0, HydroShare, http://www.hydroshare.org/resource/7b4d4e186c6b4e888876bcb713b4dff7

This resource is shared under the Creative Commons Attribution CC BY.

http://creativecommons.org/licenses/by/4.0/
CC-BY

Comments

There are currently no comments

New Comment

required