Skip to content

🚢 cargoship

Enterprise data archiving for research

Visit cargoship.io → | GitHub →

Overview

cargoship provides S3-optimized long-term data storage and archiving for research data. It manages data lifecycle, compression, deduplication, and retrieval for cost-effective long-term storage of computational results and datasets.

Key Features

Lifecycle Management

  • Intelligent Tiering - Automatic S3 tier transitions
  • Archive Policies - Hot → Warm → Cold → Glacier
  • Retention Rules - Compliance and regulatory requirements
  • Expiration - Automatic cleanup of temporary data

Data Optimization

  • Compression - Transparent compression for archives
  • Deduplication - Block-level dedup for efficiency
  • Metadata Indexing - Fast search without retrieval
  • Checksums - Data integrity verification

Cost Management

  • Storage Classes - S3 Standard, IA, Glacier, Deep Archive
  • Cost Estimation - Calculate storage costs before archiving
  • Usage Reports - Per-project and per-user breakdowns
  • Budget Alerts - Notifications at spending thresholds

Data Discovery

  • Metadata Search - Find archives without downloading
  • Tagging - Organize by project, user, date
  • Cataloging - Browse archives via web interface
  • Full-Text Search - Search within archive metadata

Use Cases

Research Data Archiving

Archive computational results, simulation outputs, and processed datasets for long-term retention.

Compliance and Retention

Meet data retention requirements for grants, regulations, and institutional policies.

Cold Storage

Store infrequently accessed data cost-effectively in Glacier or Deep Archive.

Dataset Preservation

Preserve published datasets associated with papers and ensure long-term accessibility.

Architecture

graph TB
    User[Researcher] -->|Archive| CS[cargoship CLI]
    CS -->|Upload| S3Hot[S3 Standard]

    S3Hot -->|30 days| S3IA[S3 Infrequent Access]
    S3IA -->|90 days| Glacier[S3 Glacier]
    Glacier -->|180 days| DeepArchive[S3 Glacier Deep Archive]

    CS -->|Catalog| DynamoDB[Metadata Index]
    CS -->|Search| DynamoDB

    User -->|Retrieve| CS
    CS -->|Restore| Glacier

    style CS fill:#f5e1ff
    style S3Hot fill:#fff4e1
    style Glacier fill:#e1f5ff

Storage Tiers

Hot (S3 Standard)

  • Use Case: Frequently accessed data
  • Retrieval: Instant
  • Cost: $0.023/GB/month
  • Transition: After 30 days → Warm

Warm (S3 IA)

  • Use Case: Occasionally accessed data
  • Retrieval: Instant
  • Cost: $0.0125/GB/month
  • Transition: After 90 days → Cold

Cold (S3 Glacier Instant Retrieval)

  • Use Case: Rarely accessed, instant when needed
  • Retrieval: Instant
  • Cost: $0.004/GB/month
  • Transition: After 180 days → Deep Archive

Deep Archive (S3 Glacier Deep Archive)

  • Use Case: Long-term archival, rarely retrieved
  • Retrieval: 12-48 hours
  • Cost: $0.00099/GB/month
  • Transition: Permanent archive

Getting Started

Install CLI

# Install cargoship CLI
brew install cargoship

# Or download from releases
curl -LO https://github.com/scttfrdmn/cargoship/releases/latest/download/cargoship-darwin-amd64
chmod +x cargoship-darwin-amd64
sudo mv cargoship-darwin-amd64 /usr/local/bin/cargoship

Archive Data

# Archive a directory
cargoship archive \
  --path /results/experiment-1 \
  --name "Experiment 1 Results" \
  --project "Atmospheric Chemistry" \
  --tier standard

# Archive with custom lifecycle
cargoship archive \
  --path /results/simulation \
  --name "Q1 2025 Simulation Data" \
  --lifecycle deep-archive \
  --compression gzip

# Archive from S3
cargoship archive \
  --source s3://my-bucket/results/ \
  --name "Batch Job Results" \
  --tier glacier

Search and Retrieve

# List archives
cargoship list

# Search by metadata
cargoship search \
  --project "Atmospheric Chemistry" \
  --date-after 2025-01-01

# Search by tags
cargoship search \
  --tag experiment-type:simulation \
  --tag model:geos-chem

# Retrieve archive
cargoship retrieve \
  --archive-id arch-abc123 \
  --output /restore/experiment-1

# Restore from Glacier (initiate retrieval)
cargoship restore \
  --archive-id arch-def456 \
  --tier expedited  # or standard, bulk

Manage Archives

# View archive details
cargoship describe arch-abc123

# Add tags
cargoship tag arch-abc123 \
  --tag publication:doi-10.1234/example

# Update metadata
cargoship update arch-abc123 \
  --description "Updated description"

# Set retention policy
cargoship retention arch-abc123 \
  --keep-until 2030-01-01

# Delete archive (if allowed)
cargoship delete arch-abc123

Integration with ResearchComputing

Workflow Integration

# Archive atom job results
atom job submit geos-chem ... | tee job-id.txt
atom job wait $(cat job-id.txt)
cargoship archive \
  --source s3://atom-results/$(cat job-id.txt)/ \
  --name "GEOS-Chem Run $(cat job-id.txt)"

# Archive cloudworkstation data
cloudworkstation export my-workspace --to /tmp/workspace-data
cargoship archive \
  --path /tmp/workspace-data \
  --name "ML Experiment Workspace"

# Archive lens notebooks
lens export my-notebook --with-outputs --to /tmp/notebooks
cargoship archive \
  --path /tmp/notebooks \
  --name "Research Notebooks Q1 2025"

Automated Archiving

cargoship can automatically archive results from other ResearchComputing tools:

  • atom - Archive job results after completion
  • cloudworkstation - Archive workspace snapshots
  • lens - Archive notebooks and environments
  • S3 buckets - Automatic lifecycle policies

Features

Compression

  • Transparent - Compress on archive, decompress on retrieve
  • Formats - gzip, bzip2, xz, zstd
  • Automatic - Select best compression based on data type

Deduplication

  • Block-Level - Deduplicate at block level (4MB blocks)
  • Cross-Archive - Dedup across all archives
  • Storage Savings - Typically 30-60% savings for similar datasets

Metadata

{
  "archive_id": "arch-abc123",
  "name": "Experiment 1 Results",
  "project": "Atmospheric Chemistry",
  "created": "2025-01-15T10:30:00Z",
  "size_bytes": 107374182400,
  "compressed_size": 53687091200,
  "files": 1523,
  "storage_class": "GLACIER",
  "tags": {
    "experiment-type": "simulation",
    "model": "geos-chem",
    "version": "14.4.3"
  },
  "checksum": "sha256:abcd1234..."
}

Cost Estimation

# Estimate storage costs
cargoship estimate \
  --size 100GB \
  --tier glacier \
  --duration 5years

# Output:
# Tier: Glacier
# Size: 100 GB
# Duration: 5 years
# Estimated cost: $24.00

Documentation

Technology Stack

  • Language: Go
  • Storage: S3, S3 Glacier, S3 Glacier Deep Archive
  • Metadata: DynamoDB
  • Compression: zlib, bzip2, xz, zstd
  • Infrastructure: CloudFormation, Lambda

Project Status

Current Version: v0.4.0 Status: Active Development License: Apache 2.0

Contributing

Contributions welcome! See the contribution guide.

Support


← Back to Ecosystem