Checking for non-preferred file/folder path names (may take a long time depending on the number of files/folders) ...
This resource contains some files/folders that have non-preferred characters in their name. Show non-conforming files/folders.
This resource contains content types with files that need to be updated to match with metadata changes. Show content type files that need updating.
| Authors: |
|
|
|---|---|---|
| Owners: |
|
This resource does not have an owner who is an active HydroShare user. Contact CUAHSI (help@cuahsi.org) for information on this resource. |
| Type: | Resource | |
| Storage: | The size of this resource is 6.4 GB | |
| Created: | Aug 11, 2025 at 10:11 p.m. (UTC) | |
| Last updated: | Dec 01, 2025 at 6:51 p.m. (UTC) | |
| Citation: | See how to cite this resource | |
| Content types: | CSV Content |
| Sharing Status: | Public |
|---|---|
| Views: | 579 |
| Downloads: | 50 |
| +1 Votes: | 1 other +1 this |
| Comments: | 1 comment |
Abstract
The Regional Groundwater Database for Arequipa and Surroundings is the first comprehensive, integrated dataset for southern Peru, compiling 1222 records (1969–2024) from 606 documents. It includes wells, springs, piezometers, and hydrogeological parameters like hydraulic conductivity, transmissivity, and storage coefficients. Designed to overcome data fragmentation, it enables trend analysis, groundwater modeling, and sustainable yield assessments. Publicly available via HydroShare with code on GitHub.
------------------------------------------
📌 Note on Dataset Versions:
- The file `arequipa_all_final_11_10_25_final_with_snirh.xlsx` includes all the above data PLUS additional records sourced from the public database at [snirh.ana.gob.pe](https://snirh.ana.gob.pe), specifically for the following districts:
Apurímac, Arequipa, Ayacucho, Cusco, Ica, Moquegua, Puno, and Tacna.
➤ This extended version is intended for broader analysis and cross-validation with official national hydrological records.
Subject Keywords
Coverage
Spatial
Temporal
| Start Date: | |
|---|---|
| End Date: |
Content
README.txt
--- START OF FILE README.txt ---
README: Regional Groundwater Database for Arequipa and Surroundings
Dataset Title: Regional Groundwater Database from Unstructured Sources Based on an Integrated Optical Character Recognition and Large Language Model Workflow for Arequipa and Surroundings
Authors: Héctor L. Venegas-Quiñones, Madeleine Guillen, et al.
Repository Platform: HydroShare
Date of Last Update: November 2025
1. OVERVIEW
-----------------
This repository contains the first comprehensive, open-access groundwater database for the Arequipa region and surrounding areas in Southern Peru. The dataset consolidates 10,813 records spanning the period 1969–2024.
The data was derived from two primary streams:
1. "Gray Literature" Mining: 1,624 hydrogeological records (hydraulic conductivity, transmissivity, storage coefficient, lithology, etc.) extracted from 606 technical reports and academic theses using a semi-automated AI workflow (OCR + LLM) with 100% manual verification.
2. National Databases: 9,017 depth-to-water records sourced from Peru's National Water Resources Information System (SNIRH).
2. FILE STRUCTURE AND ACCESS
-----------------
IMPORTANT: Due to the comprehensive nature of the dataset, the database is compressed into a multi-part 7-Zip archive. These files contain the full raw data extraction in Excel format.
You must download ALL parts to access the data.
Files in this Repository:
- database.7z.001: Part 1 of the compressed archive.
- database.7z.002: Part 2 of the compressed archive.
- README.txt: This technical documentation file.
Instructions to Open:
1. Download both "database.7z.001" and "database.7z.002" to the same folder on your computer.
2. You need software capable of handling split archives (e.g., 7-Zip for Windows, Keka for macOS).
3. Right-click on the first file ("database.7z.001") and select "Extract Here".
4. The software will automatically combine Part 1 and Part 2.
5. This will extract the main Master File:
"arequipa_all_final_11_10_25_final_with_snirh.xlsx"
3. DATA DICTIONARY (VARIABLE TRANSLATION)
-----------------
The column headers in the Excel file are in Spanish. The tables below provide the English translation and technical description.
A. SOURCE IDENTIFICATION (TRACEABILITY)
Spanish Header | English Translation | Description
------------------- | ---------------------- | --------------------------------------------------------
ID_Registro | Record ID | Unique identifier.
Codigo_Archivo | File Code | Internal code for the source PDF (e.g., ANA0000209).
Fuente_Completa | Full Source | Name of the institution or thesis repository.
Titulo_Documento | Document Title | Exact title of the original report/thesis.
Autor_es | Author(s) | Authors of the study.
Pagina_Referencia | Page Reference | Specific page(s) in the PDF where data was found.
B. SPATIAL & TEMPORAL
Spanish Header | English Translation | Description
------------------- | ---------------------- | --------------------------------------------------------
Fecha_Medicion | Measurement Date | YYYY-MM-DD.
Latitud_WGS84 | Latitude (WGS84) | Standardized decimal latitude.
Longitud_WGS84 | Longitude (WGS84) | Standardized decimal longitude.
Cota_msnm | Elevation | Meters above sea level.
CUENCA | Basin | Hydrographic unit name.
C. HYDROGEOLOGICAL PARAMETERS
Spanish Header | English Translation | Unit
---------------------------- | ---------------------------- | --------
Nivel_Freatico | Water Table Depth | Meters (m)
Espesor_Saturado | Saturated Thickness | Meters (m)
Conductividad_Hidraulica_K | Hydraulic Conductivity (K) | See Unidad_K
Transmisividad_T | Transmissivity (T) | See Unidad_T
Coeficiente_Almacenamiento_S | Storage Coefficient (S) | Dimensionless
Litologia | Lithology | Text
4. EXAMPLE: HOW TO TRACE DATA TO THE SOURCE
-----------------
To verify the origin of any data point in the "database.7z" files, use the 'Titulo_Documento' and 'Pagina_Referencia' columns.
Example (Row 2 in Excel):
- ID_Registro: 1
- Parameter: Transmisividad_T (Not listed in example row, but T would be here if measured) or Nivel_Freatico (176.05 masl).
- Source File: ANA0000209_1_compressed.pdf
- Document Title: "INVENTARIO Y EVALUACIÓN DE LAS FUENTES DE AGUA SUBTERRÁNEA EN EL VALLE DE ACARI"
- Author: MINISTERIO DE AGRICULTURA
- Page Reference: "54, 115"
Interpretation: The user can verify this record by finding the report "INVENTARIO... VALLE DE ACARI" (1981) and looking at pages 54 and 115, where this specific well data (ID 04/03/12-1) appears.
5. METHODOLOGY
-----------------
Data was extracted using a "Human-in-the-Loop" AI workflow:
1. OCR (Tesseract) converted scanned PDFs to text.
2. LLM (Gemini 2.5 Pro) parsed the text to extract hydrogeological parameters.
3. Validation: 100% of records were manually cross-checked by the authors against the source PDFs to ensure accuracy.
6. CITATION
-----------------
When using this dataset, please cite:
Venegas-Quiñones, H. L., et al. (2025). Regional Groundwater Database for Arequipa and Surroundings. HydroShare. [DOI Link]
--- END OF FILE README.txt ---
Credits
Funding Agencies
This resource was created using funding from the following sources:
| Agency Name | Award Title | Award Number |
|---|---|---|
| The Center for Mining Sustainability | 470266 |
How to Cite
This resource is shared under the Creative Commons Attribution CC BY.
http://creativecommons.org/licenses/by/4.0/
Comments
New Comment