Data Hygiene Is a Must for Secure & Effective AI Models

Published 19.9.2025

It’s been reported that OpenAI’s GPT-3 was trained on approximately 45 terabytes of compressed data. GPT-4, on the other hand, used a dataset of around 1 petabyte. That's a 20-fold increase in the amount of raw data needed in just three years.

If GPT-3 were a two-story house, GPT-4 would be larger than the 35-story Majakka skyscraper in Helsinki.

Why should cybersecurity professionals be concerned? Because as more enterprises develop their own vertical AI models on proprietary data, the security risks engendered by these increasingly massive systems are also growing.

Secure, effective AI systems must strike a balance between keeping and using valuable data and continuously sanitizing that which is no longer needed.

Start by Targeting ROT Data

Given the amount of data required to build an AI model, it may be tempting to extend retention limits in the belief that all your retained information may at some point be useful.

But indiscriminately hoarding redundant, obsolete, and trivial (ROT) data is bad for an organization—and for any AI model it wants to create.

Storing ROT data inflates storage costs, drives up scope 3 CO2 emissions, and potentially degrades your AI outputs. To extend the building metaphor above, if you build your AI model on substandard construction materials, it’s likely to fall over faster: garbage in, garbage out.

From a security perspective, ROT data increases risk by growing your attack surface. And ultimately—whether it’s through hacks, insider misuse, or accidental leaks—a breach containing ROT data could increase the number of individuals or organizations damaged and complicate your response and remediation.

A rigorous data hygiene program starts by auditing what exists. This means identifying duplicates, categorizing sensitive content, and enforcing retention rules.

Automating the wiping of unnecessary files, folders, and LUNs ensures that only relevant, high-quality data feeds AI systems, while expired or irrelevant data is securely erased.

AI & the Hardware Lifecycle

Blancco’s 2025 State of Data Sanitization report found that 83% of enterprises have deployed some form of AI, and, of that, 97% have upgraded either endpoints or data center equipment as a result.

The acceleration of refresh cycles means servers, laptops, and other devices are retired more quickly, and each decommissioned asset is a potential breach point.

Physical destruction alone is often not enough, as was highlighted by a recent case in the U.S., where a former ITAD employee stole federal government devices and resold them instead of sending them for disposal.

Weak chain-of-custody, incomplete erasure, and falsified destruction certificates can cause real-world data exposures, so it’s essential to focus on permanent data erasure at the point of decommissioning.

Even if your IT assets are being sent to the shredder, erasing them first with a certified, software-based data sanitization solution provides security for you and verifiable proof for regulators. Incorporating secure erasure into IT workflows ensures no device leaves the organization exposed.

Seek Continuous Improvement in Data Governance

Fully capitalizing on future generations of AI tools means baking in data compliance, retention, and sanitization rules now.

For CISOs and IT leaders, the challenge is clear; treat your data hygienically, secure your hardware fully, and let AI deliver results without creating new security risks.

By taking a “hygiene by design” approach similar to the “secure by design” school, organizations can stay more effective and more secure even as regulations change, datasets grow, and threats become more sophisticated.

Data Hygiene Is a Must for Secure & Effective AI Models

Start by Targeting ROT Data

AI & the Hardware Lifecycle

Seek Continuous Improvement in Data Governance

Read more

How “Secure at Inception” Accelerates AI Coding

Why generic malware alerts are failing cyber defenders

NIS2 and Cybersecurity in Industrial Environment