Identifying Unstructured Data

The Headache for a Business

Unstructured Data is the biggest headache today for any organisation trying to control and manage data. It consumes over 70% of all information stored and is growing at 61% per annum!

How IT departments deal with this information when they are struggling with reduced budgets and headcount remains to be seen.

Hopefully this white paper will give you a greater insight in planning and managing unstructured data.

Unstructured Data Identification

Firstly, let us understand what we are dealing with. This type of data is the information which is typically not stored in a database.

Unstructured Information manifests itself in two ways:

“TEXT” can be e-mail, texts, word documents, presentations, messaging systems, Twitter, Facebook etc.
“RICH MEDIA” can be images, sound files, movie files etc.

As we have explained, unstructured information consumes vast amounts of storage, but another consideration is legislation. Where this data resides is important if you need to retrieve the information for a compliance audit or lawsuit.

Definition

data can be of any type
not necessarily following any format or sequence
does not follow any rules
is not predictable

Structured Data

This type of data is organised and easily accessible such as databases and large search indexes. This type of data is also very fast to retrieve and interrogate for analysis or usage patterns.

Definition

data is organised in semantic chunks (entities)
similar entities are grouped together (relations or classes)
entities in the same group have the same descriptions (attributes)
descriptions for all entities in a group (schema)

The Discovery of Unstructured Data

How organisations identify this data is of vital importance to find whether it has an intrinsic value to the business or the next lawsuit waiting to happen. Firstly, we need to identify the types of unstructured information and where it currently resides. From this we can make plans to carry out the following:

How much unstructured information do we have?
How many copies of the same file do we have?
On which systems and data storage platforms does the information reside?
When was it created?
When was it last accessed?
What size is the file data?
Who owns the files?
When was it last modified?
Is the data relevant to the business?
How many copies do we have?
Do the files need to be archived?
Should the data be restricted?
Who is generating this data?
Is the data ours?

Now What?

We have now identified our unstructured data but what can we do with it? You should have a report informing you of the types of data that resides on your network. The trick is to find out the value of the data and how we can use the data as a business benefit, once we understand this we need to decide where to store it. If the data is valuable you might need to keep it in two different locations on two different type of storage. How we move the unstructured information to some sort of structured resemblance is a big challenge for any organisation.

Assuming we can move the data, we have the logistical issue of deciding the following:

stub the file - this leaves a link of where the file has moved to now
move the file
delete the file
migrate the file – rules determine what to do with it next
copy the file
backup the file
archive the file
any combination of the above

Where to store unstructured data?

We all know Tier 1 storage is expensive to purchase and maintain. So why do we store inactive data on them? We now have the tools and technology to move this inactive data to more cost-effective storage tiers.

Performance Tier 1

This is the storage tier that normally runs the fastest disk or SSD drives with the emphasis on outright performance. These systems have 99.999% availability and run the company databases, applications and user files. The time to access these files is typically milliseconds.

Capacity Tier 2

These systems normally run 7.2k rpm drives that are either SAS or SATA. They have a high disk density raid and the performance is typically about 30% slower than Tier 1. They can provide 99.99% uptime and can have dual RAID controllers. The time to access these files is typically milliseconds.

Archive Tier 3

This tier typically consists of storage data to a tape library or optical jukebox using Blu-ray optical.

Cloud Tier 4

The Cloud is also another option for storing data although the cost per GB of around £0.01 or £10 per TB per month makes this expensive. The time to access these files is typically minutes or hours depending on WAN performance and how your cloud provider has stored your archive data.

Depending on the type of data and the required retention period, consideration needs to be made regarding the use of appropriate storage technology.

Existing IT Investment

Companies spend a huge amount of money in purchasing storage and servers. The investment in the solutions is growing year on year. Recent reports indicate that by 2025 we will be purchasing twice as much storage capacity as we are today. These systems are typically retained for 3-5 years and then replaced.

By implementing a tiered data archive containing unstructured information and moving this through the different storage tiers frees up valuable disk space on the most expensive highest performing storage. By moving this data we can slow down the necessary and ongoing investment in purchasing tier 1 storage giving a huge ROI benefit. An additional benefit with active archiving is that you may be able to utilise your existing older storage systems to archive data.

Where we store this unstructured data it is an important consideration when looking at the overall IT budget and available resources in order to explore the business benefits and cost savings.

Energy Savings

As mentioned, typically 70% of stored information has not been accessed within 60 days. By moving this data to optical, tape or even a high capacity SATA RAID array will save a considerable amount of energy.

It is a well known fact:

1 WATT CONSUMED = 1 WATT TO COOL (3.412141633 BTU/h)

Many of the RAID storage arrays provide a function called MAID (Massive Array Idle Disks). In effect the RAID powers down the volumes which are not being used or accessed thus saving energy. The largest shipping MAID system we supply is a 4U rack mount model containing 60 or 64 drives depending on host interface.

Data Explosion

Clearly the growth of unstructured information is the number one problem facing IT managers today. The issue is how to control and manage this growth in an organised manner that provides a long term business benefit.

Storage arrays are offering higher performance and greater disk densities with 20+ TB drives shipping this year, LTO-9 tape storing 18TB native and Blu-ray now at 125GB.

Non Disruptive

Organisations cannot afford to be offline for prolonged periods of time. Scheduling downtime can take weeks and if it doesn’t go according to plan can be a huge headache, come Monday morning. The systems we supply are non disruptive so users are unaware data has been moved to a more efficient platform for storing unstructured data and the business benefits with better utilisation of all available storage resources.

Data Storage Solutions

The cost-effective solutions we supply can be tailored according to your firm’s requirements and budget constraints. We believe the solutions we have from our vendors will provide the following:

Cost savings
Energy savings
ROI savings
Decrease Backup times
Free up valuable Tier 1 disk space
Non disruptive to users
Enable identification of data for business governance

Professional Services

We are to offer comprehensive professional services to understand your current infrastructure and assist with any requirements you have.