Articles Archives - Versity

New Customer Spotlight: TACC Embraces Versity’s Open Flash-to-Tape Architecture for Exascale Data Archive

Adhya Khare — Mon, 17 Nov 2025 21:59:35 +0000

At the Texas Advanced Computing Center (TACC), exascale isn’t a goal— it’s a reality. With the launch of Horizon, TACC’s newest AI supercomputer and NSF Leadership-Class Computing Facility system, researchers gain access to 400 petaFLOPS of computing power and a 100x leap in AI capabilities compared to its predecessor, Frontera. Horizon is set to become the largest academic supercomputer dedicated to open scientific research in the world. But this computational leap introduces a challenge: how to store and access exabytes of data with speed, efficiency, and affordability.

Exascale Data Demands a New Approach

To meet this challenge, TACC selected Versity’s ScoutAM software to deliver the performance, scalability, and reliability needed for Horizon’s exascale data workloads. Versity’s modern, vendor-agnostic approach stood out by giving TACC full control over its archive environment.

“TACC has always been a big fan of open-source software. Versity’s software is transparent and simple,” praises Junseong Heo, Storage Manager of Advanced Computing Systems at TACC. “Unlike other vendors that lock you into specific filesystems or formats, Versity’s platform isn’t locked down. It gives us more flexibility.”

ScoutAM also integrates seamlessly with TACC’s existing environment. Researchers and administrators interact with archived files through familiar protocols (POSIX and S3) without needing to know which tier the data resides on. This means TACC’s users can continue to access archived data through the same hierarchical filesystem structure they’re used to, while Versity’s ScoutAM works behind the scenes to automatically migrate data between a front-end cache and the tape archive.

“Compatibility was also important because we didn’t want to deal with vendor lock-in,” says David Cooper, Senior Systems Administrator at TACC. With Versity, TACC retains flexibility while ensuring data remains accessible for decades to come.

Embracing a Two-Tier Flash-to-Tape Architecture

Historically, HPC sites relied on multi-tier disk architectures. But with Horizon’s I/O, TACC saw an opportunity to modernize and optimize: eliminate mid-tier disk, and adopt a higher-performance and more cost-efficient two-tier flash-to-tape architecture.

The result is Ranch, TACC’s new exascale archive, where Versity ScoutAM intelligently manages data movement between high-speed flash and high-capacity tape. Frequently accessed data stays on flash for rapid retrieval, while less-active data is automatically archived to tape.

“We needed a system that could keep up with the demands of our users and compute clusters,” Cooper remarks. “The Spectra TFinity and Versity ScoutAM solution allows us to continue the high level of service our users expect.”

VAST Data’s all-flash platform provides the high-speed tier for active workloads for Horizon. As the Ranch archive ingests data from VAST, Versity’s ScoutAM then seamlessly handles the automated placement of that data into TACC’s long-term archive according to policy. Running on 13 Dell PowerEdge R760 servers, ScoutAM orchestrates data across five Dell ME5 arrays, 16 PB of Dell ECS object storage, and Spectra TFinity libraries, delivering the automation and scale expected of a world-class research center. The Ranch system is engineered to manage up to one exabyte of data at a single site.

“VAST and ScoutAM give TACC a truly modern storage architecture built for today’s AI-driven science,” shares Kyle Lamb, Field CTO from VAST. “Both platforms were architected within the past decade and take advantage of state-of-the-art algorithms that unlock the full performance potential of modern hardware. By pairing VAST’s ultra-fast flash tier with ScoutAM’s exascale-ready archive management, TACC gets a clean, efficient architecture that delivers performance, scale, and simplicity without the baggage of legacy designs.”

This two-tier flash-to-tape model is rapidly becoming the standard for large-scale archives. Flash provides the I/O performance needed for AI and big data workflows, while tape provides virtually unlimited scale and the lowest cost per petabyte. Combined, they provide near-instant access with long-term, energy-efficient retention, eliminating the need for massive disk-based infrastructure and reducing operational overhead.

“Scalability was the most crucial factor for us, followed by price — we usually try to get as much storage ‘bang for the buck’ as we can,” Cooper comments.

Simplified Management and User Experience

Despite involving multiple components from different vendors, Versity’s solution delivers a seamless and intuitive experience for TACC’s IT team. ScoutAM sits cleanly atop the multi-vendor infrastructure, bridging front-end systems with the tape archive through a single, modular software layer. Administrators use Versity’s management interface as their central hub for monitoring the Ranch system, scheduling tasks, and handling errors, with all events preserved in context for fast troubleshooting.

The system also integrates with TACC’s broader data center monitoring tools, ensuring archive status and events are easily visible within unified operations. Although initially cautious about deploying a three-vendor solution, TACC’s team quickly found that ScoutAM simplified management by consolidating control into one cohesive, easy-to-use platform.

ScoutAM’s file system is mounted on the flash tier and presents archived data through a familiar directory interface. Files migrated to tape remain fully visible and are automatically recalled when accessed, making long-term storage feel as fast and accessible as local disk.

“We like the simplicity of Versity’s solution,” Cooper said. “Our users are already comfortable with a filesystem interface, and ScoutAM makes it easy to browse and manage data while scaling in the background.”

Built for Resilience and Future Growth

With ScoutAM, TACC’s data isn’t just accessible and easy to manage; it’s also secure for the long haul. All metadata required for restoration is stored directly with the archived data, eliminating the need for proprietary recovery software. Unlike other solutions that might use open formats but rely on proprietary containers, Versity stores files in open formats that can be read with standard tools, such as TAR. This ensures that even decades from now, TACC will be able to restore its archive independently and with confidence.

Versity further enhances resilience through flexible replication and copy policies. For critical datasets, ScoutAM can automatically create multiple tape copies (either locally or at remote sites) to guard against hardware failures. The platform also supports asynchronous replication of metadata and cached data to secondary locations, forming a strong backbone for disaster recovery.

“Versity has a very elegant and simple disaster recovery solution,” Cooper appreciates.

Of course, resilience isn’t just about architecture; it’s also about people. For a system of this scale, responsive and technically capable support was essential. TACC was impressed by Versity’s collaborative approach and our willingness to incorporate customer feedback directly into the product.

“Versity is evolving quickly, and their team has been very responsive. We’ve been able to give feedback directly to their engineers and see it reflected in the product,” noted Heo.

That responsiveness ensures the archive remains a dynamic system. One that can evolve with TACC’s needs, whether that includes upgrading to new generations of tape technology, expanding flash capacity, or scaling out to additional sites across a growing data ecosystem.

Leading a New Standard for Exabyte Archives

By deploying an open two-tier flash-to-tape archive with Versity, TACC has set a powerful example for how academic and research institutions can handle exabyte-scale data without sacrificing performance or control. This approach reflects a broader shift: flash and tape are no longer separate tiers but integrated through intelligent software into a unified architecture built for the demands of AI and HPC.

“By uniting ultra-fast flash capable of feeding compute at AI speed with high-performance, cost-effective, scalable tape, and intelligently managing it through software, TACC is embracing the model that will define the next decade of scientific storage,” notes Meghan McClelland, VP of Product at Versity. “It’s a blueprint for how HPC and AI facilities can keep pace with data growth without compromising performance or flexibility.”

The Ranch archive is now fully online and ready to support Horizon’s groundbreaking AI and HPC workloads, ensuring that the massive outputs of this supercomputer are safely preserved and readily accessible for scientific discovery. TACC achieved this with a best-of-breed solution that avoids vendor lock-in, proving that you don’t have to be tied to a single mega-vendor to get world-class results. In fact, flexibility and openness were key to building an archive that meets TACC’s aggressive requirements.

“This deployment showcases the power of an open, vendor-agnostic approach to archival storage,” says Bruce Gilpin, Co-founder and CEO of Versity. “By choosing Versity’s software, TACC has complete control over their data with no proprietary lock-in, and they’ve implemented a modern flash-to-tape architecture that will serve as a model for other exascale sites.”

At the intersection of big science and big data, TACC’s Ranch archive showcases the future of exabyte-scale research infrastructure. Versity is honored to support TACC by delivering a solution that meets today’s needs and adapts to tomorrow’s challenges. With openness, speed, and exascale ambition, Ranch is not just an upgrade – it’s the blueprint for data-driven research in the coming decades.

New Customer Spotlight: TACC Embraces Versity’s Open Flash-to-Tape Architecture for Exascale Data Archive

November 17, 2025November 18, 2025

Discover how traditional backup systems, though vital, often fall short when dealing with massive datasets. By directing backup data to an archiving platform, organizations can overcome inefficiencies, reduce storage costs, and enhance data scalability. Learn how this innovative approach can optimize your data management strategy, ensuring both long-term preservation and swift recovery.

Articles

Extending Versity S3 Gateway with a Shared Library Plugin Framework

September 9, 2025September 9, 2025

Articles

Empowering AI Science to Save Lives at NCSA

July 29, 2025September 5, 2025

The post New Customer Spotlight: TACC Embraces Versity’s Open Flash-to-Tape Architecture for Exascale Data Archive appeared first on Versity.

Extending Versity S3 Gateway with a Shared Library Plugin Framework

Meghan McClelland — Tue, 09 Sep 2025 20:15:54 +0000

The Versity S3 Gateway (VersityGW) is an open-source, high-performance S3 translation server written in Go. While Versity leads development, the project is intentionally open to community innovation. Anyone can contribute, and contributions don’t just solve local challenges—they strengthen the gateway for everyone. One notable contribution has come from engineers at CERN, who developed and contributed a shared library plugin framework for the Versity S3 Gateway.

This new capability makes it easier to add support for different storage backends by writing them as standalone plugins that can be dynamically loaded at runtime. CERN’s work is a great example of how organizations can extend the gateway to fit their needs while giving back to the community.

What the Plugin Framework Enables

The shared-library plugin system adds flexibility without complicating the core gateway:

Dynamic loading – Backends are written and maintained independently, then loaded into the Versity S3 Gateway at runtime.
Focused development – Developers only implement storage operations like putObject() and getObject(). The Versity S3 Gateway handles the rest—protocol parsing, authentication, routing, multipart handling, logging, error responses, and metrics.
Direct integration – Plugins can talk directly to backend APIs without going through POSIX, improving efficiency.
Simpler adoption – New backends can be integrated without forking or patching core code.

This approach gives storage engineers a clean, practical way to add S3 access to their systems. Because plugins live outside the main repo, they can be shared, adapted, or customized by different groups. That makes this model not only practical for developers but also collaborative—an ecosystem where one contribution can be useful to many.

CERN’s EOS Plugin

Alongside the shared-library framework itself, CERN developed the first reference implementation: the EOS S3 plugin (eoss3). This plugin provides a concrete example of how to extend the Versity S3 Gateway to integrate a large-scale production storage system.

At its core, the plugin implements a direct mapping between S3 operations and EOS APIs:

Metadata handling via gRPC – S3 metadata operations (such as object creation, deletion, or listing) are translated into EOS’s native gRPC calls, ensuring fast and efficient communication.
Data transfers via HTTP – Object data flows directly over HTTP to EOS, separating the control plane (metadata) from the data plane for performance and scalability.

With this architecture, standard S3 clients can seamlessly interact with EOS, CERN’s distributed storage platform that underpins much of the data analysis for high-energy physics. EOS manages 100s of Petabytes of scientific data produced by experiments at the Large Hadron Collider, making S3 access a critical step in enabling broader interoperability with external tools and workflows.

CERN also emphasized in their presentation on the S3 Gateway for EOS that this plugin approach brings several benefits:

Familiar interfaces for researchers – Scientists and engineers can use the S3-compatible tools they already know, while EOS continues to manage data at massive scale behind the scenes.
Fine-grained access control – EOS’s existing authentication and namespace controls remain in place, extended through the Versity S3 Gateway.
Security and IAM integration – The gateway model supports encryption and AWS IAM-style policies, aligning EOS more closely with modern cloud storage expectations.
Modularity and reusability – By building the EOS plugin as an independent shared library, CERN demonstrated how institutions can extend the Versity S3 Gateway without needing to fork or modify the gateway itself.

For other organizations, the EOS plugin serves as a reference design: a working example of how to connect a complex storage system to the Versity S3 Gateway. The source code and documentation in the EOS S3 plugin GitHub repository provide a valuable starting point for developers planning to integrate their own backends.

CERN’s EOS plugin is both a practical solution for their research community and a blueprint for collaboration, showing how specialized storage platforms can be exposed as fully S3-compatible services using the Versity S3 Gateway.

Explore & Contribute: Key Versity S3 Gateway Resources

Below is a curated list of key resources—code, documentation, and presentations—to explore, test, or contribute to the project:

Resource	Description
Versity S3 Gateway Repository	Core Versity S3 Gateway source code with plugin framework, examples, and documentation.
Developer Guide: Enabling S3 Access	Developer-friendly article explaining the plugin model and its benefits.
CERN EOS S3 Plugin	CERN’s reference implementation showing how to map S3 operations to EOS using the plugin model.
CERN EOS Project	Open-source repository for EOS, CERN’s large-scale distributed storage system.
“S3 Gateway for EOS Presentation” (PDF)	CERN talk explaining how the gateway integrates with EOS, covering APIs, access control, and IAM.

Open Source + Open Collaboration

The Versity S3 Gateway is licensed under Apache 2.0, giving users and organizations the freedom to adapt, extend, and deploy the gateway without restrictions. CERN’s work is a clear example of the power of open source: a feature created to solve one institution’s challenge became a capability that anyone can use and build upon.

We’re excited to see what the community contributes next. Whether it’s new plugins, performance enhancements, expanded protocol support, improved monitoring, or better documentation—every contribution makes the gateway stronger. The Versity S3 Gateway’s open design creates space for innovation across the stack, and we look forward to seeing how the community continues to evolve it.

New Customer Spotlight: TACC Embraces Versity’s Open Flash-to-Tape Architecture for Exascale Data Archive

November 17, 2025November 18, 2025

Articles

Extending Versity S3 Gateway with a Shared Library Plugin Framework

September 9, 2025September 9, 2025

Articles

The Benefits of Stateless Architecture in Versity S3 Gateway

December 18, 2024March 4, 2025

The Versity S3 Gateway’s stateless architecture transforms S3-compatible storage with unmatched scalability, resilience, and efficiency. Learn how it simplifies load balancing, enhances fault tolerance, and adapts seamlessly to any infrastructure.

The post Extending Versity S3 Gateway with a Shared Library Plugin Framework appeared first on Versity.

Empowering AI Science to Save Lives at NCSA

Delaney Girard — Tue, 29 Jul 2025 04:00:27 +0000

Artificial intelligence is transforming the way we respond to emergencies, enabling scientists to solve complex, high-impact problems with unprecedented speed and precision. At Cal Poly San Luis Obispo, researchers are using high-performance computing and advanced AI models to reimagine what’s possible in search and rescue. Leveraging the computational power of NCSA’s Delta Supercomputer and the scalable storage of the Granite archive powered by Versity’s ScoutAM, this work pushes the boundaries of what data-driven science can achieve. From real-time analysis to historical insight, this is a glimpse into how AI is enabling a new kind of research—one that doesn’t just predict outcomes, but helps save lives.

Accelerating Search and Rescue with AI

Search and rescue operations have traditionally relied on paper forms to track information and coordinate efforts between command posts. To modernize this process, Cal Poly researchers began by digitizing decades of historical search-and-rescue records. Building on this data, they developed IntelliSAR, an AI-powered system that helps first responders make quicker and more accurate decisions.

These models analyze real-time and historical data to detect movement patterns, highlight high-probability search zones, and prioritize the most relevant clues. By learning from both successful and unsuccessful missions, the models can highlight overlooked clues, avoid repeating inefficient search patterns, and recommend optimized strategies in real time. Some projects focus on simulating past rescue missions to improve future response strategies, while social media activity – such as profiling a missing person’s mental state before disappearance – to add context to location predictions. By dramatically improving both the speed and precision of search efforts, this AI work increases the likelihood of locating missing persons and reducing harm, especially in remote or high-risk environments.

This is a sample heatmap generated by the AI tool. The dark purple area represents the highest likelihood of locating the missing person, with the surrounding colors indicating progressively lower probabilities. Credit: NCSA.

Keeping Pace with AI: Storage and Compute at Scale

This AI-driven science runs on DeltaAI, a powerful system at NCSA that combines NVIDIA Grace Hopper Superchips with HPE’s Slingshot interconnect and Cray programming environment to deliver high-performance computing at scale. But with the vast volumes of data generated by training and running AI models, the challenge becomes clear: how do you store it all and keep it accessible for retraining, reproducibility, and future discovery?

That’s where Versity comes in.

Introducing Granite: Scalable Archival Storage with Versity ScoutAM

Versity’s ScoutAM software powers Granite, NCSA’s massive tape-based archive system. Built on Spectra Logic’s TFinity tape libraries, Granite offers more than 60 petabytes of storage, delivering the scalability, performance, and efficiency needed to support NCSA’s data-intensive research workloads, like those powering real-time search and rescue AI at Cal Poly.

ScoutAM is the brain behind the archive, automatically transferring AI model outputs, training data, and observational inputs from compute to archive as soon as it’s ready to be offloaded. This is essential in environments like NCSA, where high-performance parallel file systems must be kept available for incoming workloads while guaranteeing no valuable data is lost. For projects like AI-assisted emergency response, where every piece of training data and simulation matters, this uninterrupted lifecycle is critical.

A Seamless, Efficient Data Lifecycle

ScoutAM integrates directly into the research workflow, providing a transparent and automated path from hot storage to archive. Researchers don’t need to change their workflows or request manual data movement, ScoutAM handles the transition intelligently, based on policy and system activity.

This streamlined lifecycle means that:

Valuable AI-generated data is never lost or discarded
Storage performance is maintained without manual intervention
Archived data remains accessible for model retraining and analysis
Scientific reproducibility is supported without burdening compute systems

For scientists building life-saving AI models, this system guarantees that past data can continue to inform future decisions, whether it’s identifying new search patterns or refining terrain analysis algorithms. It allows NCSA to scale its infrastructure while preserving the flexibility and accessibility researchers rely on.

Unlocking Long-Term Insight with Historical Metrics

Historical data is vital for maintaining optimal IT system performance and effectively planning for future capacity needs. By continuously monitoring key metrics such as bandwidth usage, latency, and storage activity over time, organizations gain a clear understanding of their system behavior and can identify emerging issues early. This long-term insight enables teams to detect trends, address potential bottlenecks before they escalate, and make informed decisions about resource allocation. Without access to comprehensive historical data, organizations risk reacting too late or over-provisioning their infrastructure. Versity’s ScoutAM addresses this challenge by automatically capturing and storing detailed records of file system activity and metadata. By providing a scalable, searchable archive of historical performance and usage data, ScoutAM empowers IT teams to analyze trends, optimize system configurations, and confidently plan for growth, helping to ensure reliable, efficient, and scalable storage environments.

Supporting Research that Saves Lives

As AI becomes more deeply embedded in critical response workflows, the need for reliable, high-capacity storage infrastructure is only growing. These workloads generate vast amounts of data that must be retained, retrievable, and reusable to support evolving models and infrastructures. Versity is proud to support that mission with archival technology built for performance, scale, and long-term value.

Whether it’s accelerating life-saving search efforts or enabling reproducible AI research, our work with NCSA highlights what’s possible when innovative computing and smart data management come together.

New Customer Spotlight: TACC Embraces Versity’s Open Flash-to-Tape Architecture for Exascale Data Archive

November 17, 2025November 18, 2025

Articles

Extending Versity S3 Gateway with a Shared Library Plugin Framework

September 9, 2025September 9, 2025

Articles

Empowering AI Science to Save Lives at NCSA

July 29, 2025September 5, 2025

The post Empowering AI Science to Save Lives at NCSA appeared first on Versity.

Implicit vs. Explicit Archiving: A Deep Dive into Storage Management

Adhya Khare — Tue, 04 Mar 2025 21:00:26 +0000

As data volumes continue to grow exponentially, organizations must carefully consider how they manage storage resources. Many rely on various archiving systems to balance performance, cost, and capacity by automatically moving data between high-performance primary storage and lower-cost archival storage.

Within this framework, two distinct archiving models emerge: implicit archiving and explicit archiving. The implicit model transfers data to archival storage automatically, often without user awareness, while the explicit model requires users to make deliberate decisions about which data should be archived. These approaches differ significantly in their impact on system performance, data accessibility, and long-term storage efficiency.

This article explores the key differences between these two models, highlighting the challenges of implicit archiving and the benefits of adopting an explicit approach.

The Implicit Archiving Model

The implicit archiving model operates in the background, moving data to an archive based on predefined rules. While this approach may seem convenient, it introduces several challenges that can complicate long-term data management.

Tight Integration of Archive and Primary Filesystems

One of the defining characteristics of implicit archiving is that the archive cannot function as a standalone resource. Although data is moved to archival storage, its metadata remains in the primary filesystem’s scratch namespace. This means that users must interact with the primary filesystem to retrieve archived data, preventing the archive from being used independently.

This tight integration creates dependencies between the two storage systems. If the primary filesystem fails or becomes unavailable, access to the archive is also disrupted. Organizations that need to maintain long-term archives—especially for compliance, research, or legal purposes—may find this restriction problematic, as archived data should ideally be retrievable even if the primary filesystem is decommissioned or replaced.

Unbounded Growth of the Scratch Namespace

Since metadata remains in the scratch namespace even after data is archived, the primary filesystem continues to grow indefinitely. This unbounded accumulation of metadata can lead to severe performance degradation over time.

Performance Slowdowns: Each time the system processes a request, it must sift through an ever-growing namespace. As the metadata pool expands, file lookup times increase, leading to sluggish performance for users and applications.
System Outages: In extreme cases, an overloaded scratch namespace can contribute to system crashes or require frequent maintenance downtimes to prevent failures.
Higher Operational Costs: IT teams must invest more resources into monitoring, optimizing, and expanding primary storage infrastructure to accommodate this unchecked growth.

Vendor Lock-In and Inflexibility

Because the archive and primary filesystem are deeply interconnected, organizations are often locked into using a single vendor for both solutions. This dependency limits flexibility in several ways:

Inability to Separate Bids for Archive and Primary Storage: When the archive is tied to a specific primary filesystem, it becomes impossible for organizations to source competitive bids for different storage components. This lack of flexibility can drive up costs and reduce access to innovative solutions.
Difficulty Upgrading or Replacing the Primary Filesystem: Since archived data remains dependent on the primary filesystem, migrating to a new storage system becomes a complex and costly endeavor. In many cases, organizations are forced to maintain outdated primary storage infrastructure simply because it is required to access archived data.

A recent example of vendor lock-in’s impact is the UK government’s experience, where reliance on major cloud providers like AWS and Azure has inhibited its negotiating power over cloud services. The Cabinet Office’s Central Digital & Data Office acknowledged that this dependency could lead to minimal leverage over pricing and product options, potentially resulting in entrenched vendor lock-in and regulatory scrutiny.

User Transparency and Access Issues

Implicit archiving often results in a lack of transparency for users, who may not be aware of whether a given file is stored on primary storage (online) or archived (offline). This can lead to several unintended consequences:

Accidental “Stage Storms”: A stage storm occurs when multiple users unknowingly request offline data at the same time, causing a surge in retrieval operations. Since the system must transfer data back from the archive to primary storage, these concurrent requests can overwhelm resources and create bottlenecks.
Unpredictable Data Retrieval Times: Users may become frustrated when some files take longer to access than others, without understanding why. If they are unaware that certain files are stored offline, they may assume the system is malfunctioning.

Inefficient Use of Storage

Because implicit archiving operates automatically, data that may not be needed or useful can end up in archival storage. When data is automatically archived without careful selection, unnecessary files accumulate in the archive, consuming valuable space and driving up costs over time.

Furthermore, organizations often struggle to clearly determine what is stored in their archive and whether the archived data remains relevant. This lack of clarity can result in inefficiencies during audits, compliance checks, and long-term data retrieval efforts, ultimately complicating overall data management and potentially impacting operational effectiveness.

The Explicit Archiving Model

The explicit archiving model offers a more structured and user-driven approach to storage management. Unlike the implicit model, where data is moved automatically, the explicit model requires users or administrators to make deliberate decisions about what to archive and when. This ensures that only relevant data is preserved while reducing unnecessary storage consumption.

Improved Organization and Usability

By involving users directly in the archiving process, the explicit model fosters better organization and awareness. Users gain a clear understanding of access requirements and retrieval expectations because they intentionally choose what to store. This results in:

More predictable retrieval times, as users understand which files have been archived.
A cleaner and more structured storage environment, reducing unnecessary clutter.
Better data categorization, improving long-term storage efficiency.

Clear Separation Between Primary Filesystem and Archive

A key advantage of explicit archiving is that the archive operates as an independent resource rather than being tied to the primary filesystem. This separation provides several benefits:

Reduced Load on Primary Storage: By moving both data and metadata to the archive, the primary filesystem remains optimized for active operations.
Independent Data Retrieval: Users can access historical data without relying on the primary system, streamlining workflows and improving efficiency.
Better System Performance: Removing archived metadata from the scratch namespace prevents unnecessary slowdowns and outages.

Avoidance of Vendor Lock-In

One of the most significant drawbacks of implicit archiving is the risk of vendor lock-in. Since implicit models tightly integrate the archive with the primary storage, organizations often find themselves stuck with a single vendor’s ecosystem. Explicit archiving eliminates this issue by keeping the two systems separate.

When the primary storage and archive operate independently, organizations have the flexibility to upgrade or replace storage systems without disrupting access to archived data. If the primary filesystem reaches the end of its lifecycle, organizations can transition to a new system without worrying about losing access to archived data.

This decoupling also opens up competitive bidding opportunities, as archive solutions can be evaluated separately from primary storage. The result is a more adaptable storage infrastructure where organizations can select best-in-class solutions that meet their evolving needs

Simplified Data Retrieval and Management

With explicit archiving, files are stored in a clearly defined archival system, making data retrieval more transparent and efficient. Organizations can use various methods to move and access files, including:

Site-specific scripts or tools tailored to their infrastructure.
NFS or SAMBA transfers.
Local moves via dual-mount point server nodes.

Conclusion

While implicit archiving may seem convenient due to its automation, its hidden nature and tight coupling with the primary filesystem create numerous challenges, including performance degradation, vendor lock-in, and a lack of transparency for users. In contrast, the explicit archiving model offers a more deliberate and organized approach, promoting better storage efficiency, system reliability, and flexibility in vendor selection.

Organizations seeking to optimize their long-term data management strategy should carefully evaluate these models to ensure their storage infrastructure remains scalable, efficient, and adaptable to future needs.

New Customer Spotlight: TACC Embraces Versity’s Open Flash-to-Tape Architecture for Exascale Data Archive

November 17, 2025November 18, 2025

Articles

Extending Versity S3 Gateway with a Shared Library Plugin Framework

September 9, 2025September 9, 2025

Articles

Empowering AI Science to Save Lives at NCSA

July 29, 2025September 5, 2025

The post Implicit vs. Explicit Archiving: A Deep Dive into Storage Management appeared first on Versity.

Enhancing End-to-End Data Integrity in ScoutAM with User-Supplied Checksums

Adhya Khare — Tue, 04 Feb 2025 23:04:09 +0000

Ensuring data integrity is at the heart of modern archival systems, especially for organizations managing critical or large-scale data workflows. We’re excited to announce a new feature in ScoutAM that strengthens its already robust data integrity capabilities: support for user-supplied checksums. This enhancement adds another layer of assurance to your workflows, ensuring data accuracy and reliability at every step.

Why Data Integrity Matters

In large-scale storage environments, data integrity is critical. Whether you’re managing petabytes of data for scientific research, archiving enterprise information, or supporting high-performance computing workloads, ensuring that file contents remain unchanged is non-negotiable. Any corruption or unintended modification of files can disrupt operations, introduce errors, or even lead to data loss.

ScoutAM has always prioritized data integrity with built-in checksum verification. For more background information on what a checksum is or how it works, check out our detailed explanation here.

ScoutAM’s Comprehensive Checksum Support

One of the key advantages of the stateless architecture is its ability to scale horizontally. As your storage needs grow, you can easily add more Versity S3 Gateway instances to handle increased traffic without having to worry about complex state or session synchronization between those instances. Each new instance can immediately begin processing requests independently, allowing your system to handle growing demand seamlessly.

Efficient Load Balancing

ScoutAM has long supported popular cryptographic hash algorithms, including:

MD5
SHA1
SHA256
SHA384
SHA512

These built-in checksums allow users to verify file integrity at various stages of the workflow. However, we recognize that many organizations rely on externally generated checksums to meet internal policies, regulatory requirements, or legacy system compatibility. This is where our new feature comes in.

Introducing User-Supplied Checksums

With this new capability, ScoutAM now allows users to supply their own independently-generated file-level checksums during file ingestion or migration. This provides an additional layer of validation and seamlessly integrates with existing workflows.

How It Works:

During File Ingestion: Users can supply pre-generated checksums (e.g., from an external system or prior validation process) alongside the files being ingested into ScoutAM.
Checksum Verification: ScoutAM immediately verifies the supplied checksum to confirm file integrity during ingestion.
On-Demand Validation: At any point, users can initiate checksum verification via the CLI or API to ensure that file contents remain intact.
Automated Verification During Data Movement: Checksums can be automatically re-validated during data transfers within ScoutAM, such as during replication, migration, or retrieval processes.

By allowing users to import their own checksums, ScoutAM enhances compatibility with other systems and strengthens its end-to-end data integrity features.

CLI Example:

# md5sum /mnt/scoutfs/testfile
29ddb9ac92635ced72af4cb9c66c6803  /mnt/scoutfs/testfile
# samcli file checksum --type MD5 --set 29ddb9ac92635ced72af4cb9c66c6803 /mnt/scoutfs/testfile

API Example:

# TOKEN=$(curl -s -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' --data '{"acct":"admin","pass":"versity"}' http://scoutam.domain:8080/v1/security/login' | jq -r '.response')

# curl -s -k -X POST -H "Authorization: Bearer $TOKEN" -H "Accept: application/json" -H "Content-Type: application/json" --data '{"path":"testfile","type":"MD5","value":"29ddb9ac92635ced72af4cb9c66c6803"}' https://scoutam.domain:8080/v1/file/checksum

Use Cases for User-Supplied Checksums

Legacy System Migrations: When migrating from existing systems (like Versity Storage Manager or other platforms), organizations can bring their pre-existing checksums to verify data consistency.
Regulatory Compliance: Industries with strict compliance standards often require checksum verification at multiple stages. User-supplied checksums provide an extra layer of assurance.
Data Integrity Auditing: Users can cross-verify file contents against externally-generated checksums for complete confidence in data accuracy.
Custom Validation Workflows: Organizations that generate checksums as part of their internal validation or ingest pipeline can now seamlessly integrate this process into ScoutAM.

Why This Matters

End-to-end data integrity is one of the core pillars of reliable storage management. ScoutAM has long upheld this standard with robust built-in checksum verification. Now, by enabling user-supplied checksums, ScoutAM addresses critical vulnerabilities that can emerge during data ingestion and retrieval.

These vulnerabilities arise when checksums are generated solely within the archival system, leaving potential gaps in protection at key stages, such as:

Data being copied to the archive filesystem cache.
Data being archived to media, where the checksum is generated.
Data being released from the filesystem cache.
Data being staged from archival media with checksum verification.
Data being retrieved from the archive filesystem cache to primary storage.

When checksums are generated in step 2, the interactions in steps 1 and 5 remain unprotected. This leaves the data source, network transmission, and storage hardware susceptible to corruption that could go undetected.

To close these gaps, ScoutAM empowers users or applications to generate and supply their own checksums at the point of ingestion. By validating these checksums immediately and throughout the data lifecycle, ScoutAM ensures unmatched end-to-end data integrity, giving you full confidence that your data remains accurate and reliable from ingestion to long-term storage.

By supporting user-supplied checksums, ScoutAM offers:

Greater Flexibility: Users can leverage their existing checksum workflows without any disruption.
Improved Assurance: Independent verification adds another layer of confidence in your data.
Seamless Integration: Checksums can be supplied, verified, and revalidated effortlessly through ScoutAM’s CLI and API.

Whether you’re moving data between systems, monitoring files over time, or ensuring compliance with industry standards, this new feature makes ScoutAM an even more powerful solution for managing data integrity.

Conclusion

With the introduction of user-supplied checksums, ScoutAM continues to set the standard for reliable, high-performance archive management. This new feature complements its existing suite of tools for data verification, providing users with maximum confidence that their data remains intact.

Learn more about ScoutAM here and how it meets the evolving needs of modern storage management.

Articles

New Customer Spotlight: TACC Embraces Versity’s Open Flash-to-Tape Architecture for Exascale Data Archive

November 17, 2025November 18, 2025

Articles

Extending Versity S3 Gateway with a Shared Library Plugin Framework

September 9, 2025September 9, 2025

Articles

Empowering AI Science to Save Lives at NCSA

July 29, 2025September 5, 2025

The post Enhancing End-to-End Data Integrity in ScoutAM with User-Supplied Checksums appeared first on Versity.

The Benefits of Stateless Architecture in Versity S3 Gateway

Adhya Khare — Wed, 18 Dec 2024 17:34:13 +0000

The Versity S3 Gateway (versitygw) operates as a stateless service for handling S3-compatible requests. In simple terms, this means that each request is independent of the others, and no persistent session or state is maintained between requests. This stateless design brings a variety of benefits, particularly when it comes to scaling, load balancing, and ensuring high availability (HA) in your system architecture.

What Is Stateless Architecture?

In a stateless system, every request to the Versity S3 Gateway includes all the necessary information needed for the Gateway to process it. The system doesn’t rely on data from previous requests, making it highly efficient and scalable. This also applies to multipart uploads, which are often used for large object storage in S3-compatible environments. The initiate, upload parts, and complete stages can all be handled by any Versity S3 Gateway instance as long as they are connected to the same backend storage system.

In a stateless setup, large multipart uploads can be processed across multiple gateway instances simultaneously. This allows for greater flexibility and scalability, as the workload can be distributed across all available resources, limited only by the capacity of the backend storage system.

Horizontal Scalability Made Simple

Efficient Load Balancing

The stateless nature of Versity S3 Gateway also makes load balancing straightforward and highly effective. Whether you’re using a software load balancer like HAProxy or a hardware load balancer like F5, statelessness allows for the smooth distribution of requests across all available instances.

Because no session affinity (or “sticky sessions”) is required, load balancers can:

Distribute requests evenly across all Versity S3 Gateway instances.
Route traffic to the least-loaded instance, optimizing system performance.
Add or remove instances dynamically without disrupting ongoing requests.

This creates a system that maximizes resource usage and prevents any one gateway from becoming a bottleneck.

Resilience and High Availability

A major benefit of stateless design is the inherent resilience it provides. If a Versity S3 Gateway instance were to fail, the load balancer can immediately redirect requests to another instance without any data loss or service disruption. This ensures high availability (HA) and fault tolerance, even in the face of unexpected failures.

In contrast, a stateful system would require additional mechanisms to synchronize states across multiple instances, increasing complexity and the risk of delays or data loss.

Simplified Maintenance and Upgrades

A stateless system also simplifies system maintenance. Individual Versity S3 Gateway instances can be taken offline for upgrades or repairs without affecting the overall system. You can perform rolling updates, upgrading, or replacing instances one at a time while others continue to handle requests, minimizing downtime and ensuring continuous service.

Infrastructure Flexibility

The flexibility of a stateless architecture means that the Versity S3 Gateway can be deployed in a variety of environments, whether on-premises, in the cloud, or as part of a hybrid setup. It works seamlessly with modern containerized environments like Kubernetes, where instances can be dynamically scaled based on demand without the need for session persistence or state synchronization.

Load Balancing Options

While Versity S3 Gateway’s stateless architecture simplifies scaling and high availability, load balancing is essential for distributing traffic efficiently. Here are the pros and cons of the most common load-balancing approaches:

HAProxy (Software Load Balancer)
HAProxy is an open-source load balancer known for its flexibility, performance, and configurability. It’s widely used for balancing TCP and HTTP traffic across multiple backend servers.
- Pros: Cost-effective, highly configurable, high performance.
- Cons: Complexity, often needs to be installed on every S3 client system.
DNS-Based Load Balancing
DNS-based load balancing uses multiple IP addresses for a single domain name, distributing traffic across various servers. However, it lacks advanced traffic management and service health checks, making it less suitable for services requiring continuous uptime.
- Pros: Global distribution, simple setup, no single point of failure.
- Cons: DNS TTL delays, limited traffic management, no health checks.
Hardware Load Balancer (e.g., F5, Cisco)
Hardware load balancers are dedicated physical devices that manage all traffic between clients and servers. While they deliver exceptional performance, they come with increased costs and operational complexity.
- Pros: High performance, advanced traffic management, built-in redundancy.
- Cons: Expensive, limited scalability, vendor lock-in.

Load Balancing Options

The Versity S3 Gateway’s stateless architecture offers a scalable, resilient, and highly available solution for S3-compatible storage environments. By eliminating the need for session persistence, it simplifies both scaling and load balancing, allowing for efficient horizontal scalability, easier maintenance, and improved fault tolerance. Additionally, a variety of load balancing options, from software-based to hardware solutions, can be deployed to ensure the best fit for your specific infrastructure.

This flexibility makes Versity S3 Gateway an excellent choice for organizations looking to build or expand their data storage capabilities, ensuring both performance and reliability as their needs grow.

New Customer Spotlight: TACC Embraces Versity’s Open Flash-to-Tape Architecture for Exascale Data Archive

November 17, 2025November 18, 2025

Articles

Extending Versity S3 Gateway with a Shared Library Plugin Framework

September 9, 2025September 9, 2025

Articles

The Benefits of Stateless Architecture in Versity S3 Gateway

December 18, 2024March 4, 2025

The post The Benefits of Stateless Architecture in Versity S3 Gateway appeared first on Versity.

New Customer Spotlight: Earth Sciences New Zealand Chooses Versity to Modernize Massive Weather Archive

Adhya Khare — Thu, 29 Aug 2024 21:54:51 +0000

In an era where unstructured datasets are growing at an exponential pace, efficiently managing these vast and complex collections is more crucial than ever. Earth Sciences New Zealand is at the forefront of this challenge. Their mission to understand and protect the nation’s environment relies on a massive, ever-growing collection of unstructured data. This data is the lifeblood of their research, informing critical decisions about climate change, water resources, and marine ecosystems.

To meet this challenge, ESNZ required a storage solution that could not only accommodate their massive dataset but also ensure its long-term preservation, accessibility, and security. Versity’s innovative approach to data management addressed their specific needs and delivered a solution that safeguards New Zealand’s environmental legacy for generations to come.

Leaders in Environmental Science

Earth Sciences New Zealand is a cornerstone in understanding and safeguarding the nation’s environment. Their comprehensive data collection, from the ocean floor to the upper atmosphere, provides invaluable insights into climate patterns, freshwater systems, and marine ecosystems.

At the heart of their research capabilities is the new Cascade supercomputer, which triples the institute’s previous compute capacity, delivering 2.4 petaflops of performance across 61,440 CPU cores, supported by 240 TB of RAM and 7 PB of high-speed storage. This power enables researchers to run more complex simulations, produce higher-resolution forecasts, and incorporate AI-driven analyses to tackle pressing environmental questions. Designed for scalability and sustainability, Cascade is expected to save millions in energy costs over its lifetime.

Complementing this computational power is the Rapids archive: a robust, long-term data repository that begins with 22 PB of capacity and is designed to scale to 100 PB over the next decade. Rapids safeguards decades of critical environmental data — from atmospheric measurements to oceanographic surveys — ensuring that New Zealand’s scientific heritage is preserved for generations and readily available to researchers.

This vast and well-curated dataset is central to informed decision-making. By translating complex scientific findings into actionable insights, ESNZ equips policymakers and stakeholders to make data-driven choices that strengthen resilience to environmental change and guide the stewardship of New Zealand’s aquatic and terrestrial ecosystems. By harnessing the full potential of their data, ESNZ continues to pioneer innovative solutions that address environmental challenges and foster a sustainable future for the country.

Why ESNZ Chose Versity

ESNZ selected Versity to manage their Rapids archive due to our innovative approach to data management and our commitment to addressing their unique needs. At the heart of our solution is Versity’s ScoutAM, a cutting-edge mass storage platform designed to provide exceptional control and flexibility in data archiving. ESNZ needed a solution that gave them more control over their archive, and they favored ScoutAM’s explicit archiving model, which effectively decouples filesystems from tape storage. This allows for intentional data management while avoiding vendor lock-in, ensuring that ESNZ can adapt and expand their solution as their needs evolve without being tied to a single vendor.

Furthermore, our commitment to open formats guarantees ESNZ data ownership and independent access. This transparency enables ESNZ to maintain control of their data even if they choose to use another system in the future.

To complement this foundation, our modern, high-performance tape management system offers enhanced recall times and superior data management capabilities. Versity’s ScoutAM stands out as the only mass storage platform that supports existing data formats, providing ESNZ with a seamless and efficient archiving experience. This eliminates the time-consuming and costly challenges of data migration, ensuring that ESNZ’s critical data remains secure and intact.

Data protection and accessibility are paramount for organizations like ESNZ. Our solution features automatic data replication between primary and secondary sites, which is essential for maintaining data integrity across locations. This capability ensures that ESNZ’s data remains consistently protected and readily available, meeting their high standards for reliable and efficient management.

Finally, ESNZ values our customer-centric approach and the open, influential support we provide. Our commitment to understanding and addressing their specific needs underscores our dedication to delivering a solution that supports their mission and enhances their data management capabilities.

“We are pleased to have prevailed in the competitive bidding process to provide the high capacity data management technology for ESNZ in support of this crucial modernization effort,” said Bruce Gilpin, CEO of Versity Software. “ScoutAM’s advanced capabilities ensure that ESNZ can access its entire historical data collection while transitioning from its legacy storage systems. This enables ESNZ to convert to a modern platform without the pain of a traditional data migration, which is a key element of the Versity value proposition.

New Customer Spotlight: TACC Embraces Versity’s Open Flash-to-Tape Architecture for Exascale Data Archive

November 17, 2025November 18, 2025

Articles

Extending Versity S3 Gateway with a Shared Library Plugin Framework

September 9, 2025September 9, 2025

Articles

Empowering AI Science to Save Lives at NCSA

July 29, 2025September 5, 2025

The post New Customer Spotlight: Earth Sciences New Zealand Chooses Versity to Modernize Massive Weather Archive appeared first on Versity.

Beyond Backup: How An Integrated Archive Solution Can Tackle Backup Data Challenges

Adhya Khare — Mon, 26 Aug 2024 19:13:41 +0000

As organizations manage ever-growing datasets reaching petabyte or exabyte levels, traditional backup methods are increasingly strained to keep up. The sheer scale of these datasets often reveals the limitations of standalone backup systems, which struggle with extended backup windows, performance inefficiencies, and higher storage expenses. Additionally, many backup vendors don’t specialize in tape technology, which often leads to systems delivering only a fraction of their potential performance.

To address these challenges, a new approach is gaining traction: directing backup data to an archiving platform rather than directly to storage. This stacked approach leverages the strengths of both backup and archiving systems, enabling more efficient data management and storage. By integrating backup solutions with advanced archival platforms, organizations can optimize data handling, improve scalability, and reduce costs.

In this article, we explore the evolving landscape of data management, focusing on the benefits of integrating backup and archiving solutions. We will examine how traditional backup systems manage data and the advantages of archiving for long-term data preservation and regulatory compliance. By providing a comprehensive overview of these strategies, we highlight how an integrated solution combining backup and archiving can create a more efficient, scalable, and cost-effective data management solution.

Backup: Ensuring Data Recovery

The primary goal of backup systems is to enable the recovery of primary data at specific points in time. This process involves creating multiple versions of a given data set, which allows for restoration in cases of hardware failure, data corruption, malware attacks, or accidental deletions. Backup data serves as a secondary copy and never replaces primary data.

A significant challenge of backup systems lies in managing extensive metadata. These systems must optimize for fine-grained versioning to achieve point-in-time recovery, allowing the system to revert to its precise state at the time the backup was captured. Unlike systems that prioritize high aggregate throughput or extreme scalability, backup systems focus on space efficiency through compression and deduplication, which eliminates redundant copies in data storage.

Because they don’t prioritize efficient data streaming for the copies, an issue arrives with larger data collections. As the volume of data increases, the time required to complete incremental and full backups—known as the “backup window”—extends, potentially leaving data vulnerable between cycles. Consequently, backups can become less efficient and harder to scale over time.

Archiving: Long-Term Data Preservation

Archiving, in contrast to backup systems, involves relocating valuable primary data from expensive, high-performance storage solutions to more cost-effective, long-term mass storage. This shift not only optimizes storage costs but also ensures that data is preserved for extended periods. Once data is archived, the archival copy becomes the master copy, meaning that the original data on the high-performance storage can be safely deleted. Essentially, while backup involves creating duplicate copies of data for recovery purposes, archiving is about moving data to a different location for long-term storage.

Data archiving is essential for several compelling reasons, the first of which is cost efficiency. Mass storage solutions used for archiving are generally less expensive and more scalable than high-performance storage, making it feasible to store large volumes of data without incurring prohibitive costs. This efficiency is vital for organizations as it allows them to manage storage expenses while still preserving valuable information for the long term.

Furthermore, by segregating active data from inactive data, archiving improves data management by making it easier to manage and access current operational data. This optimizes space on primary storage systems, improving their performance and extending their lifespan by reducing the load and wear on these systems.

Beyond cost savings, archiving plays a crucial role in regulatory compliance. Many industries are subject to regulations that mandate data preservation for specific periods. Archived data is protected from loss, corruption, or unauthorized access, ensuring compliance and helping organizations avoid fines and legal issues. Furthermore, archived data offers valuable historical insights, enabling trend analysis, informed decision-making, and AI model training.

Beyond its role in regulatory compliance and data analysis, archiving offers significant advantages in terms of data accessibility. Unlike backup systems, which involve a time-consuming restore cycle to recover data, archiving provides a more streamlined approach. Once data is archived, it is stored in a way that allows for faster retrieval and streaming, bypassing the lengthy restore processes associated with backups.

This makes archiving essential for disaster recovery and business continuity plans. Having an archived copy of critical data ensures an organization can quickly restore operations and access essential information after a disaster or system failure. Archiving also contributes to the preservation of knowledge and intellectual property. By addressing these diverse needs, data archiving supports a comprehensive data management strategy, ensuring information remains preserved, accessible, and cost-effectively managed.

Conclusion

As datasets continue to grow exponentially, traditional standalone backup solutions are increasingly insufficient. For these organizations an integrated solution that combines the strengths of both backup software with archival platforms becomes essential.

In such integrated solutions, backup software channels data to an archival system, which then applies advanced policies to efficiently organize and stream data to mass storage devices. By using the archival platform for advanced data handling, the backup system can scale to much greater capacities without hitting performance bottlenecks.

Versity champions this integrated approach to data management, offering cutting-edge archiving functionalities that seamlessly integrate with leading backup solutions from partners such as Rubrik, NetBackup, Veeam, and Commvault. For instance, Versity’s ScoutAM integrates with Rubrik’s NAS Cloud Direct to archive petabyte-scale backups at GB/s throughput to cost-efficient media. This integration allows users to benefit from the economic advantages of cloud-scale storage while maintaining a secure, air-gapped on-premises solution.

Versity’s data management platform enhances this integration with advanced data lifecycle management policies that efficiently migrate data to lower-cost, high-capacity storage. By directing backup writes to a Versity mount point, our solution optimizes data storage by strategically sorting it so that smaller files are stored on an object storage system while larger files are archived on tape. This method does not require changes to existing backup workflows, providing faster access to object copies crucial for file retention checks. Additionally, Versity supports both object and tape storage through the same file interface, eliminating the need for special gateways or object protocols to support cloud copies. This optimization of random incoming data streams into efficient streaming data to the archive maximizes storage resource utilization, reduces costs, and ensures long-term data preservation. Importantly, this method does not require any changes to existing backup workflows; it simply works faster and better, delivering significant improvements in data management without disrupting operations. This Versity banking case study is a great example of how this approach delivers tangible benefits in real-world scenarios.

Furthermore, the integration delivers a significant improvement in data restore times. Versity’s solution achieves this by meticulously differentiating between frequently accessed backup index/metadata files and less frequently accessed data sets. This critical distinction empowers the system to prioritize restoring essential index/metadata files, enabling significantly faster retrieval of the underlying data. This translates to minimized downtime and expedited recovery in the event of a disaster.

By combining Versity’s archiving solution with our partner’s backup solutions, we create a streamlined and effective data protection strategy that enhances both efficiency and recovery times. Our unwavering commitment lies in providing innovative solutions that empower organizations to safeguard their critical data while ensuring efficient management. We invite you to explore Versity’s comprehensive data management portfolio and discover how we can tailor a solution to achieve optimal data protection and preservation for your organization.

The Benefits of Stateless Architecture in Versity S3 Gateway

December 18, 2024March 4, 2025

Articles

New Customer Spotlight: Earth Sciences New Zealand Chooses Versity to Modernize Massive Weather Archive

August 29, 2024July 24, 2025

Discover how ESNZ, New Zealand’s leading environmental research institute, is safeguarding the future of the nation’s climate data by partnering with Versity. Learn why they chose Versity’s ScoutAM to modernize their vast weather archive, ensuring the long-term preservation and accessibility of critical environmental data. This article dives into the innovative approach that made Versity the ideal choice for this crucial modernization effort.

Articles

Beyond Backup: How An Integrated Archive Solution Can Tackle Backup Data Challenges

August 26, 2024January 13, 2025

The post Beyond Backup: How An Integrated Archive Solution Can Tackle Backup Data Challenges appeared first on Versity.

Looking Back, Reaching Forward: The Journey Behind the Versity S3 Gateway

Adhya Khare — Mon, 24 Jun 2024 16:20:30 +0000

From the outset, the storage industry has wrestled with a pivotal challenge: achieving seamless data access across an ever-evolving and complex landscape of workflows and storage systems. As organizations embraced modern workflows and protocols, efficiently accessing data on a mix of on-premises, cloud-based, and computational storage devices became a daunting challenge.

Initially, a solution was available through MinIO. However, changes to MinIO’s licensing made it nearly impossible to support a forked version that maintained this functionality. The subsequent deprecation of its NAS gateway feature forced users to seek alternative solutions, leaving a gap in readily available options for the community.

At Versity, we were motivated to address this gap, driven by the necessity for this functionality in our product portfolio and our dedication to open-source solutions and community support. Thus, the Versity S3 Gateway was developed to provide a solution for mixed protocols and storage systems, seamlessly translating between AWS S3 object commands to other storage systems, including file-based storage systems and Azure cloud storage. Our innovation aims to streamline data workflows and enhance data accessibility across diverse storage environments.

Developing the Versity S3 Gateway

The main challenge in developing the Versity S3 gateway was the complexity of the S3 API, due to its numerous parameters and variations. This complexity has become so significant that SNIA, a standards organization, is attempting to create standardized tests to ensure consistent functionality across different S3-compatible systems in order to address the complexity. In the meantime, the Versity S3 Gateway tackles the complexities of the S3 API through its powerful front-end API handler. This handler can interpret the various incoming S3 requests and translate them into actions the underlying storage system understands.

Ensuring compatibility across a diverse array of storage systems was of paramount importance during the Versity S3 Gateway’s design and development. To address this, our development team implemented a modular and extensible architecture, allowing the Gateway to easily adapt to different storage systems and protocols. Collaboration with prestigious institutions like Los Alamos National Laboratory (LANL) and Pawsey Supercomputing Research Centre provided invaluable insights and testing grounds. This ensured the Gateway met the demanding performance and scalability needs of large-scale supercomputing environments.

Achieving high performance and scalability for large-scale data archiving was another major requirement. We tackled this by implementing a stateless, scalable architecture and optimizing the software, leveraging the high-performance gofiber framework. Rigorous performance benchmarking and stress testing further ensured the Gateway’s efficiency and ability to handle massive data volumes quickly.

Witnessing the Gateway deployed in customer production environments was a pivotal moment. It served as a testament to its practical value and readiness for real-world use. The project’s open-source nature fostered a collaborative spirit, with forks and valuable contributions from the community further refining the Gateway’s features and performance. This collaborative approach, coupled with our continuous engagement with the user community, ensures the product evolves in line with their needs. User feedback has been overwhelmingly positive, consistently highlighting the Gateway’s ease of use, simplicity, and impressive performance. Their demand for broader storage system support, diverse authentication methods, and improved metadata handling directly influenced product development.

After more than a year of development, our team is thrilled to announce the successful completion of the alpha and beta phases, culminating in the general availability of the Versity S3 Gateway. This marks a significant milestone, showcasing its readiness for diverse workflows and highlighting the collaborative effort behind its success.

Ensuring Optimal Performance

When designing the Versity S3 Gateway, performance considerations were paramount, especially for large-scale data operations. We needed a solution capable of seamlessly scaling to handle exabyte-sized collections for archiving. Therefore, the Gateway was built to manage high data ingest rates and large-scale data transfers with exceptional scalability and performance. Deploying multiple Versity S3 Gateway instances in a cluster can significantly increase aggregate throughput. Its stateless architecture ensures that any request can be serviced by any instance, effectively distributing workloads and enhancing overall performance.

Moreover, the Versity S3 Gateway leverages Fiber, a lightweight and high-performance HTTP server framework, to handle incoming requests. Compared to older web frameworks like gorilla/mux, Fiber offers significantly improved performance, resulting in faster processing and response times. This combination of a stateless architecture and a high-performance framework ensures the Versity S3 Gateway can efficiently manage large-scale data operations and deliver consistent, reliable performance at scale.

To ensure Versity S3 Gateway’s stability and reliability for production use, we implemented a comprehensive testing and quality assurance process, which includes:

Automated Testing: Extensive automated test suites are run for each software build to identify and address potential issues early in the development cycle.
Performance Benchmarking: Rigorous performance testing is conducted to ensure the gateway can handle large-scale data loads and deliver consistent performance.
Stress Testing: The gateway undergoes stress testing to evaluate its behavior under extreme conditions and ensure it can maintain stability and reliability.
User Acceptance Testing: Engaging with a select group of users to validate new features and enhancements in real-world scenarios before general release.
Tagged Releases: Release tags automatically update the software release packages and Docker images allowing customers to choose when to update production environments.

This robust process guarantees that the Versity S3 Gateway remains stable, reliable, and ready for production use.

Using the Versity S3 Gateway in Archiving

In addressing the needs of data archiving and long-term storage, the Versity S3 Gateway integrates with ScoutAM, our commercial mass storage data management platform. This powerful combination simplifies data management workflows and reduces costs by allowing users to efficiently and cost-effectively store, retrieve, and manage vast data volumes across diverse mass storage systems.

The integration offers several key benefits. Firstly, the Versity Gateway optimizes data uploads by minimizing data reads and writes. During multipart uploads, data parts are written directly to the underlying storage only once and then assembled into a single file at the upload’s completion. This eliminates a full read/write cycle, potentially doubling performance for large object ingestion.

Additionally, the Versity S3 Gateway supports “Glacier Mode,” enabling organizations to leverage ScoutAM for a tiered storage approach. Less frequently accessed data can be seamlessly moved to lower-cost storage tiers, empowering organizations to manage storage expenses effectively. The Versity Gateway’s compatibility with the Glacier Mode API ensures that organizations can seamlessly integrate their current data lifecycle client workflows with ScoutAM’s robust file storage and management capabilities. Hence, combining the strengths of Versity S3 Gateway and ScoutAM delivers enhanced performance, scalability, and seamless access for both object and file workflows.

The Gateway’s Future

The vision for Versity S3 Gateway’s future includes continuing to innovate and enhance the tool to meet the evolving needs of our users and the community. We aim to expand its capabilities by integrating with emerging storage technologies and ensuring community feedback remains a crucial factor in shaping our roadmap. Our goal is for the Versity S3 Gateway to become the de facto standard for S3 compatibility. As a crucial component of our product portfolio for data archiving and long-term storage solutions, we hope to see it adopted in similarly essential roles within the community and deployed in production environments worldwide.

New Customer Spotlight: TACC Embraces Versity’s Open Flash-to-Tape Architecture for Exascale Data Archive

November 17, 2025November 18, 2025

Articles

Extending Versity S3 Gateway with a Shared Library Plugin Framework

September 9, 2025September 9, 2025

Articles

The Benefits of Stateless Architecture in Versity S3 Gateway

December 18, 2024March 4, 2025

The post Looking Back, Reaching Forward: The Journey Behind the Versity S3 Gateway appeared first on Versity.

Customer Spotlight: How LANL Leverages the Versity S3 Gateway For Supercomputing Applications

Adhya Khare — Fri, 14 Jun 2024 21:47:52 +0000

Los Alamos National Laboratory stands at the forefront of national security research. Since its inception in 1943, LANL has consistently pushed the boundaries of scientific understanding, playing a pivotal role in shaping the modern world. From its historic contributions to the Manhattan Project to its ongoing leadership in cutting-edge fields like computational science, LANL has established itself as a cornerstone of scientific progress.

Today, LANL’s High-Performance Computing (HPC) division exemplifies this commitment to innovation, enabling researchers to manage exabytes of data from cutting-edge research. They continuously push the boundaries of extreme-scale supercomputing, enabling researchers to tackle the exabytes of data generated by cutting-edge scientific research.

This drive to leverage new technologies led to the adoption of the Versity S3 Gateway. This solution bridges the gap between object protocols and various storage backends, including computational storage. It allows researchers to directly access and query simulation data from NVMe storage devices using S3 commands and workflows. Pushing data reduction functions closer to the storage devices saves power and time, allowing analytics functions to be performed on a much smaller analytics cluster vs the traditional ‘big iron’ HPC machines.

Challenges with Scientific Data Analytics

Scientific research routinely generates massive datasets, often exceeding petabytes in size for one time step of a single simulation that might capture thousands of time steps. This sheer volume of data presents significant challenges in the realm of scientific data analytics.

Firstly, moving these datasets to analytics applications is time-consuming and expensive, especially since scientific queries typically focus on small data portions. Furthermore, the limitations of legacy data analysis workflows exacerbate this challenge. Traditional workflows necessitate transferring all raw scientific data associated with a query result to the application, demanding that the application execute analysis code on the entirety of the dataset. This leads to unnecessary overhead and undue strain on computational resources.

Using the Versity S3 Gateway for OCS

To address these limitations, LANL developed a novel approach. They envisioned a system where, upon query initiation, data processing occurs directly on a dedicated computational storage device. This device would then transmit only the relevant results to the host application, thereby significantly reducing unnecessary data movement.

LANL leverages an object-based computational storage (OCS) infrastructure, which allows NVMe devices to directly access and interpret data blocks, necessary for query pushdown capabilities. This system simplifies data mapping between data and NVMe blocks compared to traditional file systems. LANL partnered with SK Hynix, leveraging their advanced memory solutions, to develop this advanced computational storage device capable of handling query pushdown and data analytics.

However, in order to push analytic functions down from a logical object view users have of data to a block based NVMe, a translation has to be made. The Versity S3 Gateway facilitates seamless communication between disparate storage systems and enhances query pushdown capabilities. Combined with Apache columnar analytics tools, it bridges the gap between storage technologies, enabling efficient data analysis on massive datasets.

The Versity S3 Gateway streamlines scientific workflows by eliminating data transfers between object storage and NVMe. It removes server input/output (I/O) bottlenecks and improves data access times, allowing a single host to manage petabyte-scale data volumes efficiently. This marks a significant advancement in object data processing capabilities, resulting in faster analysis times, improved research productivity, and deeper scientific insights.

“We are thankful that Versity engaged to produce a flexible and performant S3 gateway that enabled our exploration of push-down analytics at scale,” said Dominic Manno, lead of hot storage research at LANL. “Versity’s open community gateway technology has and will play a part in our journey toward providing next-generation at-scale analytics that leverage the Apache ecosystem.”

Conclusion

Scientific research, particularly at institutions like LANL, often grapples with managing and analyzing massive datasets. These exabyte-sized datasets can be prohibitively expensive to move and analyze, hindering the pace of scientific discovery.

The Versity S3 Gateway bridges the traditional gaps between disparate storage technologies, significantly enhancing the efficiency and scalability of LANL’s HPC applications. By streamlining the integration of object storage and computational storage devices like NVMe, the Gateway accelerates data access, reduces bottlenecks, and empowers researchers to handle large data volumes more effectively.

As LANL continues to lead in computational science and national security research, the Versity S3 Gateway stands out as a critical component in their technological arsenal, driving faster research outcomes and enabling deeper, more insightful scientific discoveries. This advancement underscores LANL’s commitment to maintaining its status as a cornerstone of global scientific progress and innovation.

Articles Archives - Versity

New Customer Spotlight: TACC Embraces Versity’s Open Flash-to-Tape Architecture for Exascale Data Archive

Exascale Data Demands a New Approach

Embracing a Two-Tier Flash-to-Tape Architecture

Simplified Management and User Experience

Built for Resilience and Future Growth

Leading a New Standard for Exabyte Archives

Read more here

Extending Versity S3 Gateway with a Shared Library Plugin Framework

What the Plugin Framework Enables

CERN’s EOS Plugin

Explore & Contribute: Key Versity S3 Gateway Resources

Open Source + Open Collaboration

Read more here

Empowering AI Science to Save Lives at NCSA

Accelerating Search and Rescue with AI

Keeping Pace with AI: Storage and Compute at Scale

Introducing Granite: Scalable Archival Storage with Versity ScoutAM

A Seamless, Efficient Data Lifecycle

Unlocking Long-Term Insight with Historical Metrics

Supporting Research that Saves Lives

Read more here

Implicit vs. Explicit Archiving: A Deep Dive into Storage Management

The Implicit Archiving Model

Tight Integration of Archive and Primary Filesystems

Unbounded Growth of the Scratch Namespace

Vendor Lock-In and Inflexibility

User Transparency and Access Issues

Inefficient Use of Storage

The Explicit Archiving Model

Improved Organization and Usability

Clear Separation Between Primary Filesystem and Archive

Avoidance of Vendor Lock-In

Simplified Data Retrieval and Management

Conclusion

Read more here

Enhancing End-to-End Data Integrity in ScoutAM with User-Supplied Checksums

Why Data Integrity Matters

ScoutAM’s Comprehensive Checksum Support

Efficient Load Balancing

Introducing User-Supplied Checksums

Use Cases for User-Supplied Checksums

Why This Matters

Conclusion

Read more

The Benefits of Stateless Architecture in Versity S3 Gateway

What Is Stateless Architecture?

Horizontal Scalability Made Simple

Efficient Load Balancing

Resilience and High Availability

Simplified Maintenance and Upgrades

Infrastructure Flexibility

Load Balancing Options

Load Balancing Options

Read more about the Versity S3 Gateway

New Customer Spotlight: Earth Sciences New Zealand Chooses Versity to Modernize Massive Weather Archive

Leaders in Environmental Science

Why ESNZ Chose Versity

Read more here

Beyond Backup: How An Integrated Archive Solution Can Tackle Backup Data Challenges

Backup: Ensuring Data Recovery

Archiving: Long-Term Data Preservation

Conclusion

Read more here

Looking Back, Reaching Forward: The Journey Behind the Versity S3 Gateway

Developing the Versity S3 Gateway

Ensuring Optimal Performance

Using the Versity S3 Gateway in Archiving

The Gateway’s Future

Read more the Versity S3 Gateway

Customer Spotlight: How LANL Leverages the Versity S3 Gateway For Supercomputing Applications

Challenges with Scientific Data Analytics

Using the Versity S3 Gateway for OCS

Conclusion

Read more about the Versity S3 Gateway