Checking for non-preferred file/folder path names (may take a long time depending on the number of files/folders) ...
This resource contains some files/folders that have non-preferred characters in their name. Show non-conforming files/folders.
This resource contains content types with files that need to be updated to match with metadata changes. Show content type files that need updating.
| Authors: |
|
|
|---|---|---|
| Owners: |
|
This resource does not have an owner who is an active HydroShare user. Contact CUAHSI (help@cuahsi.org) for information on this resource. |
| Type: | Resource | |
| Storage: | The size of this resource is 629.0 MB | |
| Created: | Nov 12, 2024 at 6:59 p.m. (UTC) | |
| Last updated: | Apr 14, 2026 at 6:30 p.m. (UTC) | |
| Citation: | See how to cite this resource | |
| Content types: | CSV Content |
| Sharing Status: | Public |
|---|---|
| Views: | 1907 |
| Downloads: | 1 |
| +1 Votes: | Be the first one to this. |
| Comments: | No comments (yet) |
Abstract
The DROMEDARY US dataset is a publicly accessible collection that encompasses 3,246 basins throughout the contiguous United States. It includes several datasets: the Gages II dataset (Falcone, 2011), the Cropland Data Layer from the CropScape web platform (Han et al., 2012), meteorological data from the Daymet dataset (Thornton et al., 2021), and streamflow data published by the USGS, accessible via their data retrieval tool (De Cicco et al., 2018).
This dataset was created to train Long Short-Term Memory (LSTM) neural networks to predict daily streamflow time series at the outlets of these basins. It aims to provide a comprehensive sample of basins across various hydroclimatological contexts in the US, including those that have undergone significant changes in land use and land cover.
The dataset is divided into static attributes and time series. The time series are stored in .nc files, with the 3,246 .nc files compressed into an archive that is split into three parts to facilitate uploading.
More details are provided in the article: "Using Long Short-Term Memory Neural Networks to Assess Streamflow Alteration from Land-Use and Land Cover Changes: Application to Fallowed Land Across the United States" by Baptiste Francois, Samson Zhilyaev and Casey Brown, which was submitted to Water Resources Research journal. The full reference of the article will be updated once/if it gets accepted for publication.
Falcone, J., 2011. GAGES-II: Geospatial Attributes of Gages for Evaluating Streamflow. Reston, VA. https://doi.org/10.3133/70046617
De Cicco, L.A., Hirsch, R.M., Lorenz, D., Watkins, D., Johnson, M., 2024. dataRetrieval: R packages for discovering and retrieving water data available from Federal hydrologic web services, v.2.7.15. https://doi.org/10.5066/P9X4L3GE
Thornton, P.E., Shrestha, R., Thornton, M., Kao, S.-C., Wei, Y., Wilson, B.E., 2021. Gridded daily weather data for North America with comprehensive uncertainty quantification. Sci. Data 8, 190. https://doi.org/10.1038/s41597-021-00973-0
Han, W., Yang, Z., Di, L., Mueller, R., 2012. CropScape: A Web service based application for exploring and disseminating US conterminous geospatial cropland data products for decision support. Comput. Electron. Agric. 84, 111–123. https://doi.org/10.1016/j.compag.2012.03.005
Subject Keywords
Coverage
Temporal
| Start Date: | |
|---|---|
| End Date: |
Content
README.md
LSTM input data: DROMEDARY basin sample with dynamic land use (Francois et al., 2026)
This resource contains NeuralHydrology-style inputs for continental-U.S. streamflow modeling with time-varying cropland/land-cover (CDL) fractions and static basin attributes. It supports the LSTM experiments that pair DayMet meteorology, USGS streamflow, Gages II basin descriptors, and USDA NASS Cropland Data Layer (CDL) categories aggregated to each basin.
Contents overview
| Path | Description |
|---|---|
time_series/ |
One NetCDF (.nc) file per basin: daily forcings, CDL-derived land-use fractions, and observed discharge. |
attributes/ |
static_attributes.csv: one row per basin with Gages II–based static features, climatology derived from DayMet (2008–2023), and a single-year (2015) CDL snapshot used as static covariates. |
The basin list included here has 3246 gauges (see time_series/list_DROMEDARY_basins.txt).
time_series/
Files
{USGS_ID}.nc— NetCDF-4 dataset for basinUSGS_ID(8-digit USGS streamgage identifier, zero-padded where applicable). There is one file per basin in the sample.list_DROMEDARY_basins.txt— Text list of all basin IDs (one ID per line), matching the expected.ncfilenames without the extension.list_all_gages.py— Optional helper script: if run from insidetime_series/after all.ncfiles are present, it scans*.ncand regenerateslist_DROMEDARY_basins.txt.
Temporal coverage and coordinate
- Time dimension:
date— daily timestamps from 2008-01-01 through 2023-12-31 (inclusive), aligned across variables. - Spatial unit: Each file corresponds to a single gaged catchment; variables are basin averages (meteorology from basin-averaged DayMet; CDL classes as percent of basin area).
Variables in each NetCDF
Meteorology (DayMet, basin average) — variable names in the file:
| Variable | Description (summary) |
|---|---|
dayl |
Day length |
prcp |
Precipitation |
srad |
Shortwave radiation |
swe |
Snow water equivalent |
tmax, tmin |
Maximum / minimum air temperature |
vp |
Vapor pressure |
pet |
Potential evapotranspiration |
Units follow the DayMet product conventions for gridded DayMet variables (see the DayMet documentation for exact units and definitions).
CDL-derived land use / land cover (daily time series) — each field is the percent of basin area in that CDL aggregate class (the classes sum to 100% at each time step). Values come from annual CDL summaries; on the daily time axis they stay fixed for long stretches and change roughly around the turn of the calendar year (end of December into January). Variable names use underscores (e.g. Grassland_Pasture, Developed_Open_Low for “Grassland/Pasture”, “Developed Open/Low”).
Corn, Cotton, Rice, Sorghum, Soybeans, Oilseed, Barley, Spring_Wheat, Winter_Wheat, Other_Cereals, Alfalfa, Other_Hay, Nuts, Peas_Beans, Tree_Crops, Melons, Berries, Herbs, Roots, Vegetables, Double_Crops, Aquaculture, Fallow, Developed_Open_Low, Developed_Med_High, Forest, Wetlands, Shrubland, Grassland_Pasture, Open_Water, Perennial_Ice_Snow, Barren
Streamflow (target)
| Variable | Description |
|---|---|
QObs |
Observed discharge as millimeters per day (mm d⁻¹), computed from USGS cubic feet per second using the basin drainage area from Gages II (DRAIN_SQKM) and standard unit conversions used in the project’s NeuralHydrology preprocessing. |
attributes/
File: static_attributes.csv
- Rows: One per basin; the first column is the USGS basin ID (same identifier as the
{USGS_ID}.ncfilenames). When reading with pandas, useindex_col=0and treat IDs as strings (zero-pad to 8 digits if needed). - Columns: Concatenation of:
- Gages II basin characteristics (e.g. drainage area, gage coordinates, dam density, withdrawals, hydro modification and morphology indices, elevation, slope, soil hydrologic group fractions
HGA…HGVAR, and encoded basinCLASS). - Climatology computed from DayMet basin-average daily time series over 2008–2023: long-term mean annual precipitation and PET, aridity index (PET / precipitation), precipitation seasonality (
PREC_SEAS), snow fraction (SNOW_FRAC), and metrics of high/low precipitation frequency and duration (HPF,HPD,LPF,LPD). - CDL snapshot for a single reference year (2015): the same cropland/land-cover class columns as in the dynamic NetCDF files, representing percent of basin area for that year (used as static inputs in configurations that do not feed the full daily CDL stack).
Column names match the headers in the CSV (e.g. DRAIN_SQKM, LAT_GAGE, LNG_GAGE, DDENS_2009, … through the CDL aggregates). For soil-group codes (HGA, HGB, …), percentages describe the share of each NRCS hydrologic soil group within the basin (see Gages II documentation).
Data provenance (short)
- Streamflow: USGS NWIS (processed Gages II–style daily extracts used in the project).
- Meteorology: DayMet, spatially averaged to the basin (basin-mean CSVs / workflow described in the repository).
- Static physiographic attributes: Gages II basin characteristics (USGS Gages II).
- Land cover: USDA NASS Cropland Data Layer (CDL), aggregated to catchments and organized into the crop/land-cover classes above (https://www.sciencedirect.com/science/article/abs/pii/S0168169912000798?via%3Dihub).
Using this dataset with NeuralHydrology
dataset: genericdata_dirto this dataset rootdynamic_inputs,static_attributes, andtarget_variablesconsistent with the variables above (training may use a subset of the columns present in the files).
Point data_dir to the folder that contains time_series/ and attributes/ as siblings.
Citation and contact
When publishing this resource (e.g. on HydroShare), cite the associated paper / DOI and this dataset’s HydroShare identifier once it is assigned. For questions about field definitions or preprocessing, email baptiste@tova.earth
Credits
Contributors
People or Organizations that contributed technically, materially, financially, or provided general support for the creation of the resource's content but are not considered authors.
| Name | Organization | Address | Phone | Author Identifiers |
|---|---|---|---|---|
| Casey Brown | University of Massachusetts, Amherst | |||
| Samson Zhilyaev | University of Massachusetts, Amherst | |||
| Baptiste Francois | Tova Earth Inc. |
How to Cite
This resource is shared under the Creative Commons Attribution CC BY.
http://creativecommons.org/licenses/by/4.0/
Comments
There are currently no comments
New Comment