Checking for non-preferred file/folder path names (may take a long time depending on the number of files/folders) ...
This resource contains some files/folders that have non-preferred characters in their name. Show non-conforming files/folders.
This resource contains content types with files that need to be updated to match with metadata changes. Show content type files that need updating.
Authors: |
|
|
---|---|---|
Owners: |
|
This resource does not have an owner who is an active HydroShare user. Contact CUAHSI (help@cuahsi.org) for information on this resource. |
Type: | Resource | |
Storage: | The size of this resource is 39.5 GB | |
Created: | Feb 27, 2025 at 5:09 a.m. | |
Last updated: | Feb 27, 2025 at 5:40 a.m. | |
Citation: | See how to cite this resource |
Sharing Status: | Public |
---|---|
Views: | 112 |
Downloads: | 0 |
+1 Votes: | Be the first one to this. |
Comments: | No comments (yet) |
Abstract
This resource is a backup of the Water Quality Portal (WQP), a database of water quality samples from the U.S. Geological Survey and U.S. Environmental Prorection Agency. This resource includes:
1. An R script (run.R) that downloads the data from WQP web services.
2. A hiearchical archive of zipped csv files, organized by geographic area (in general, Country/State/County or equivalent)
3. A data dictionary for the Water Quality Portal csv exports
Subject Keywords
Content
README.md
Backup the Water Quality Portal
Description
This is a backup of the Water Quality Portal (Legacy WQX 2.2 profile, including USGS NWIS water quality samples through March 11, 2024, and EPA STORET water quality samples through February 2025) Files are in zipped CSVs organized in a directory by Country (State for the US). In each Country/State.zip file are county directories, in which are separate zipped CSVs corresponding to each WQP "data profile" :
- Organization Data (
organizations.zip
) - Site Data Only (
sites.zip
) - Project Data (
projects.zip
) - Project Monitoring Location Weighting (
weighting.zip
) - Sample Results (Physical/Chemical) (
physChem.zip
) - Sample Results (Biological) (
biological.zip
) - Sample Results (Narrow) (
narrowResult.zip
) - Sampling Activity (
activity.zip
) - Sampling Activity Metrics (
activityMetric.zip
) - Biological Habitat Metrics (
resultDetectionQuantitationLimit.zip
) - Result Detection Quantitation Limit Data (
biologicalMetric.zip
)
A data dictionary for all fields for each of these profiles is additionally provided in WQX_Data_Dictionary.zip
The Non-US "Countries" are as follows:
FM (Federated States of Micronesia) CA (Canada) GT (Guatemala) IN (India) LE (Lake Erie) LH (Lake Huron) NI (Nicaragua) OA (Atlantic Ocean) OI (Indian Ocean) OP (Pacific Ocean) QO (Lake Ontario) QS (Lake Superior) MX (Mexico) RM (Marshall Islands) PS (Palau) YT (Mayotte) ZC (Caribbean Sea)
Script
An R script for archiving the Water Quality Portal (WQP), and the resulting files. This script systematically downloads data as zipped csv by the lowest administrative unit possible (typically county, but varies by country) to minimze server timeouts and improve archive indexing, organizing it into a hierarchical directory structure.
- Downloads data from each WQP Web Service endpoint
- Handles both countries with and without county-level administrative divisions
- Creates organized directory structure based on geographic hierarchy
- Includes retry logic
- Comprehensive logging and progress tracking
- Rate limiting to respect API endpoints
Directory Structure
The script creates a hierarchical directory structure based on geographic divisions:
For countries with county systems (US, FM, PS, RM):
locations/
├── US/
│ ├── 06_California/
│ │ ├── 001_Alameda/
│ │ │ ├── sites.zip
│ │ │ ├── organizations.zip
│ │ │ └── ...
│ │ └── 003_Alpine/
│ └── 36_New_York/
└── FM/
└── ...
For countries without county systems:
locations/
├── CA/
│ ├── 01_Alberta/
│ │ ├── sites.zip
│ │ ├── organizations.zip
│ │ └── ...
│ └── 02_British_Columbia/
└── MX/
└── ...
Requirements
- R >= 4.0.0
- Required R packages:
- tidyverse
- httr
- fs
- jsonlite
- furrr
- progressr
- parallelly
Install dependencies:
r
install.packages(c("tidyverse", "httr", "fs", "jsonlite", "furrr", "progressr", "parallelly"))
Usage
-
Clone the repository:
bash git clone https://github.com/ksonda/wqp-backup.git cd wqp-backup
-
Run the script:
r source("run.R")
The script will: 1. Create the complete directory structure 2. Download data for each endpoint in sequence 3. Process locations in parallel within each endpoint
Configuration
The tool's behavior can be customized by modifying the CONFIG
list in the script:
r
CONFIG <- list(
base_url = "https://www.waterqualitydata.us/data",
endpoints = list(...),
base_dir = "locations",
location_types = list(
county_countries = c("US", "FM", "PS", "RM"),
state_countries = c("CA", "MX")
),
parallel = list(
workers = parallelly::availableCores() - 1, # Use all cores except one
chunk_size = 100 # Number of locations to process in each chunk
)
)
base_url
: Base URL for the Water Quality Portal APIendpoints
: List of endpoints and their configurationsbase_dir
: Base directory for downloaded datalocation_types
: Geographic division configurationsparallel
: Parallel processing settingsworkers
: Number of parallel workers to usechunk_size
: Number of locations to process in each chunk
Logging
- Failed downloads logged to
download_errors.log
Rate Limiting
To respect the API's resources: - 1-second delay between requests - Exponential backoff on failures - Maximum of 3 retry attempts per download
License
This project is licensed under the MIT License - see the LICENSE file for details.
References
- Water Quality Portal. Washington (DC): National Water Quality Monitoring Council, United States Geological Survey (USGS), Environmental Protection Agency (EPA); 2021. https://doi.org/10.5066/P9QRKUVJ.
Related Resources
The content of this resource is derived from | Water Quality Portal. Washington (DC): National Water Quality Monitoring Council, United States Geological Survey (USGS), Environmental Protection Agency (EPA); 2021. https://doi.org/10.5066/P9QRKUVJ. |
How to Cite
This resource is shared under the Creative Commons Attribution CC BY.
http://creativecommons.org/licenses/by/4.0/
Comments
There are currently no comments
New Comment