5.2 KiB

This is a regularly updated mirror of all data files linked from data.gov.
The repository is maintained by the Harvard Law School Library Innovation Lab as part of our project to preserve U.S. federal public data.
Collection Format
Each dataset on data.gov has a unique slug known as its name
. We store each dataset
in this repository as:
collections/data_gov/<name>/<version>.zip
We also store a metadata file for each dataset in the metadata
directory:
metadata/data_gov/<name>/<version>.json
<version>
is a v
followed by the number of times we have downloaded the dataset
(v1, v2, etc.)
For example, the data.gov dataset https://catalog.data.gov/dataset/fruit-and-vegetable-prices is stored in this repository as:
- collections/data_gov/fruit-and-vegetable-prices/v1.zip
- metadata/data_gov/fruit-and-vegetable-prices/v1.json
Dataset Format
Each dataset zip file is a BagIt package created by our bag-nabit tool.
BagIt is a simple file format, established by the Library of Congress, consisting of a folder of metadata and text files. Our BagIt files follow this directory structure:
data/
files/
:...
: these are the actual files you likely want to use as a researcher, downloaded from the data.gov listing.
headers.warc
: request and response headers from HTTP fetches for files infiles/
signed-metadata.json
: metadata including data.gov's API description of the dataset
The bags also contain these files, which are useful for authenticating the provenance of the data:
bagit.txt
: standard BagIt filebag-info.txt
: standard BagIt filemanifest-sha256.txt
: standard BagIt filetagmanifest-sha256.txt
: standard BagIt filesignatures/
: directory of signature files
Metadata File Format
Each metadata JSON file contains three main sections:
-
bag_info
: Contains the BagIt metadata including:- Bag-Software-Agent: The version of nabit used to create the archive
- Bagging-Date: When the archive was created
-
signed_metadata
: Contains detailed information about the dataset including:id
: A UUID for this specific archiveurl
: The data.gov URL for the datasetdescription
: A brief description including the dataset title and creating organizationdata_gov_metadata
: The complete metadata from data.gov's API, including:- Dataset details (title, description, etc.)
- Organization information
- Resource listings
- Tags and other metadata
collection_tasks
: Records of the HTTP requests made to collect the dataset
-
zip_entries
: Listing of each entry in the collection zip file, which can be used to fetch individual files from the zip file via range request without downloading the entire archive.
Rollup files
There are several rollup files at the top level to help with finding datasets of interest:
metadata.csv.zip
: CSV listing the name, organization, title, date, metadata path, and collection path for each datasetmetadata.jsonl.zip
: JSON lines file with complete metadata for each dataset, including thesigned_metadata
andzip_entries
sections (equivalent to downloading the metadata/ directory as a single file)file_listing.jsonl.zip
: zipped JSON lines file showing the s3 listing of all files in the repository
Downloading data
To download an individual dataset by name you can construct its URL, such as:
https://source.coop/harvard-lil/gov-data/collections/data_gov/fruit-and-vegetable-prices/v1.zip
https://source.coop/harvard-lil/gov-data/metadata/data_gov/fruit-and-vegetable-prices/v1.json
To download large numbers of files, we recommend the aws
or rclone
command line tools:
aws s3 cp s3://us-west-2.opendata.source.coop/harvard-lil/gov-data/collections/data_gov/<name>/v1.zip --no-sign-request
Data Limitations
data.gov includes multiple kinds of datasets, including some that link to actual data files, such as CSV files, and some that link to HTML landing pages. Our process runs a "shallow crawl" that collects only the directly linked files. Datasets that link only to a landing page will need to be collected separately.
Source code
The source code used to generate this and other repositories is available at https://github.com/harvard-lil/data-vault. We welcome conversation and collaboration in the issue tracker for that project.
Collection Dates and Update Schedule
Files in this repository were collected intermittently between 2024-11-19 and 2025-02-06.
Beginning on 2025-02-06, we will update the repository daily.