Storing Increasingly Massive Amounts of Data
The digital transformation is taking place everywhere from smart fridges to car telematics. Data today is being generated faster than ever before, whether it’s a corporate mail system or social media site, the data resides on a physical data storage platform somewhere. This document is primarily based on corporate information and how the data is secured, backed up and used.
Data in a company covers many data types, audio/video, paper, office programs, pictures, .PDF, .PST and many more file types. Each of these file extensions requires a specific program to read the file and present it to the user.
The worldwide demand for data storage over the next 5 years will increase on average 700%, this will be due to.
- IoT (Internet of Things)
- Technology making it easier to capture and store information
- Increased processing and imaging power
- New and emerging technologies, VR, AR, AI, 4k, 8k, holographics, self-driving vehicles, robots.
- The retail industry alone will experience data growth of 900% within 5 years.
- 2 Billion Photos are shared daily on Snapchat, WhatsApp, Facebook, Instagram and Flickr.
- 65% growth in mobile data traffic over last 12 months driven by video
- 59% of companies do not fully understand their data storage estate
No matter where you look information is rising at a faster rate than at any time in history and will only increase and trying to perform housekeeping on this information is problematic.
Data Storage Problems
There are today many different types of storage systems available to store and protect your data, from a NAS, SAN, Tape and the ubiquitous cloud. All provide an area for you to run applications and help keep and secure the data for access. Storing data is easy, maintaining access and housekeeping is the problem for companies of all sizes. Storage is cheap and capacities are increasing so why not move everything across to the new storage platform and store it forever? Data is constantly being created, businesses acquire competitors, integrate them along with all their messaging, files and information, staff come and go, applications are phased in and out, so the cycle continues.
Is storing all data the solution?
Storing information is expensive. Data migrations from legacy systems to new, investing in new LAN/WAN links, software licensing, support and maintenance contracts. The management of all this information is a mammoth task for any IT department.
- Over 90% of data is never accessed again after 90 days
- Data volumes are growing at 61% annually
- Managing data is a liability.
- Only 5% of all information collected is analysed
There are many types of cloud storage vendors to choose including Microsoft Azure and Amazon AWS to name a few. These cloud providers provide endless amounts of storage space to store data at a relatively affordable price and provide different performance tiers of storage to suit a company’s budget. No longer do you need to worry about maintenance and support contracts, just pay a monthly service charge.
Cloud companies such as Google, Microsoft or Amazon do not use data silos or RAID, they use Object Storage to store vast amounts of data of any type and size.
Object storage is a method for storing millions of files and petabytes of data on storage nodes using a global namespace and file system. It is vastly more scalable than traditional file or block storage.
Need additional disk capacity, just add another storage node. All object data is automatically distributed across the nodes, you can create object replicas or shard data (see below) across the nodes to aid recovery and increase performance.
Provides up to 75% cost savings compared to primary NAS filers whilst providing 100% business continuity.
How does object storage work?
Object storage requires 3 things to work.
- The data and this could be anything
- Metadata provides a customisable table and index of the object being stored
- A globally unique identifier assigned to every object
Object storage uses metadata to manage the millions/billions of objects it contains. Metadata is created by the administrator of the object store and it contains information relating to each type of object it is storing e.g. a photo, metadata information could be:
The metadata could contain far more information relating to the object type than a file or block. Everything stored within your object storage is given a unique metadata identifier and every piece of data has the metadata and file combined into a single object.
An object could be any type of file. When it comes to object storage, metadata resides in the objects themselves. There is no need to build databases to associate metadata with the objects. Custom metadata can be created about an object file based on contents, dates, user information, permissions, etc. Attributes can be changed, added or deleted over time. Metadata is highly searchable and allows you to find files far faster than file or block storage. You can conduct searches that return a set of objects that meet specific criteria, such as what percentage of object are of a certain type or created by owner “x”. This allows companies to extract insights from the big data they possess within their data and identify trends.
What is metadata?
Metadata provides additional descriptive data related to a file, for example a photographic image might include the information below:
- Created/Modified/Saved dates
- Image resolution
- Image size
- Image resolution
- Camera type
- Location of image
Whilst the list of information contained within the file is reasonably extensive it is nigh on impossible to add extra information to the photograph. Metadata takes this information one step further by creating an additional metadata file that could also contain the following:
- Who took the photograph?
- What are the photographs contents?
- Person in photograph wearing Red shoes etc
- Summary of the photograph
- Keywords associated with the photograph
File = Data
Metadata = Descriptive information
How do companies cope?
The easy part is storing information on disk, cloud, tape, optical, flash arrays the list goes on. All these have pro’s and con’s in terms of management, running costs, subscription fees, WAN/LAN links.
- Backup everything forever
- Increase WAN/LAN links
- Protect, replicate and clone information
A company does a great job of protecting this data and is viewed as essential to the business as it might be related to a recent sale, legal case, bio-science information, engineering drawings, car design, the list is endless.
For many years data archiving was used to move information off primary tiered storage to lower cost storage i.e. tape or SATA RAID arrays. Some data archiving software maintained the symbolic link to the original file. This would free up storage space and improve performance by freeing up system resources. Nowadays with all flash storage arrays becoming the norm, performance isn’t as critical as it was with hard disks. Whilst these all flash arrays use compression and deduplication to provide massive amounts of storage space, not all data can be deduped or compressed for example video or music files.
The problems with data archiving are it requires constant housekeeping in order to ensure that the archive is maintained and when you upgrade to a larger storage device on occasion those symbolic links are lost.
It is important that all archival data resides in a single storage platform rather than spread across a multitude of storage platforms from various hardware vendors, as this could lead to delays in resolving a dispute, project.
Journaling helps the company respond to legal, regulatory and organisational compliance requirements by recording inbound and outbound email communications, ensuring staff are not abusing the trust the company has put on them.
Email is a powerful enterprise software tool that enables worldwide communication and transmission of information across the internet. It is easy and simple to attach a .PDF and send this to a 3rd party organisation for them to action. Journaling creates a copy of every email communication and then stores this in its own separate mailbox. There are many types of email journaling software solutions available and some of the most popular are Commvault, Mimecast and VERITAS. These email journaling vendors provide tools for you to search and analyse mail messages for eDiscovery, compliance and regulatory compliance. They all provide a method of removing the data from its native environment and storing it elsewhere.
The downside is that they store the information in proprietary formats that can cause problems when you want to migrate data or change vendors. The data is no longer in its native format and cannot be read by the original application that wrote it unless it is regurgitated to its original format.
Most of the data a company generates is unstructured and this is 70% – 80% of all data stored, this does not reside on a single storage platform, it is kept on a wide variety of storage platforms including NAS, SAN, Cloud to name a few.
Unstructured data is the Gorilla in our storage that is difficult to handle due to the severe diversity of files types created.
Unstructured data consists of a variety of document types including Word, Excel, PowerPoint, Audio, Video, Text, Messaging, Pictures, .PDF, Images etc.
Firstly, let’s understand what we are dealing with. Unstructured data is the information which is typically not stored in a database.
Unstructured Data Unstructured Data manifests itself in two ways:
- “TEXT” can be e-mail, texts, word documents, presentations, messaging systems, Twitter, Facebook etc.
- “RICH MEDIA” can be images, sound files, movie files etc
As we have explained, unstructured data consumes vast amounts of storage, but another consideration is legislation. Where this data resides is important if you need to retrieve the information for a compliance audit or lawsuit.
- data can be of any type
- not necessarily following any format or sequence
- does not follow any rules
- is not predictable
Data is a digital format that should be easy to find. Nine times out of ten it isn’t since:
- New storage systems have been put in place and old systems removed
- Data hasn’t been backed up as expected
- Data has been deleted accidentally or maliciously
- Data has been copied to a different location that no one knows about
- Employing IT staff to find data costs money
When a company creates a document i.e. an employee handbook, this is typically created using an office application and it might be sent via email to various departments legal, HR etc. to check the correct terminology and wording used by the company. Before the handbook is sent to the employee you could 20-30 versions of the same document stored across the storage area network each with a slightly different revision. Once the handbook is signed off then it is emailed to employees. If the document is 500Kb and depending on the size of the business, that email attachment sent to 1,000 employees would consume 500MB of disk space on the mail server. It is far easier to send an attachment via email rather than sending a link. External companies are typically unable to access these links as they relate to internal drive mappings within the business.
This is where the problem lies a small 2-minute video could be 300MB and this is then shared across the company network could consume hundreds of gigabytes and within a day your company network is filling up with unnecessary unstructured data and this is how the cycle continues.
For years a software technology called “single instance” only allowed a single copy of a file to be shared, whilst this worked well for mail servers as this was the repository for all mail data, it typically doesn’t work where each department of a company stores data on different storage systems.
Some of the newer storage systems may well have data deduplication and this takes the information one step further than single instance by removing the white space in a document, looking for repetitive characters and creating a digital #hash signature of the document. A data storage system with deduplication can typically create space savings of between 20:1 – 50:1 and this could massively reduce the amount of disk space unstructured data is consuming. There is however a downside, whilst many storage vendors provide deduplication it is rare that it is used during normal working hours as it will slow down the performance of the data storage solution, normally deduplication is carried our as a post-production process during the night when there is less load on the system.
The first task in any company is to sort the information by file type, so that an employee knows for example that all .PDF files reside on a NAS and this is the location where they need to store PDF files once, they have been written or read. Our company is no different we have drive letters for each document type and then sub-folders based on the content.
There are companies that have Petabytes of data spinning 24×7, but no one can easily trawl this information to identify the following:
- What needs to be kept or deleted?
- Is legally required for compliance, legislation or governance?
- Can the data cause the business embarrassment and fines?
- Does the data have a value to the business but can’t be identified?
- Scanned documents going back decades.
- Email PST files containing years of conversations.
- How much unstructured data do we have?
- How many copies of the same file do we have?
- On which systems and data storage platforms does the information reside?
- When was it created?
- When it was last accessed?
- What size is the file data?
- Who owns the files?
- When it was last modified?
- Is the data relevant to the business?
- How many copies do we have?
- Do the files need to be archived?
- Should the data be restricted?
- Who is generating this data?
- Is the data ours?
We have now identified our unstructured data but what can we do with it? You should have a report informing you of the types of data that resides on your network. The trick is to find out the value of the data and how we can use the data as a business benefit, once we understand this we need to decide where to store it. If the data is valuable you might need to keep it in two different locations on two different storage systems.
What is EDRM?
The Electronic Discovery Reference Model (EDRM) is an eDiscovery model created by the EDRM industry association to establish standards and guidelines in the emerging eDiscovery market. The model outlines standards for the recovery and discovery of digital data in response to a discovery request.
Summary Description of the EDRM Stages
Information governance is the management of electronic data to mitigate risk and expenses should eDiscovery become an issue – from initial creation of ESI through its final disposition. Information governance is a proactive process intended to reduce overall cost and risk.
- Identification – Locating potential sources of ESI and determining the potential data set’s scope, breadth and depth.
- Preservation – Preservation of data, often referred to as litigation (or legal) hold, ensures that ESI is protected against inappropriate alteration or destruction.
- Collection – Gathering all potentially relevant ESI for further use in the eDiscovery process (processing, review, etc.).
- Processing – Reducing the volume of ESI and converting it, if necessary, to forms more suitable (and cost effective) for review and analysis.
- Review – Evaluating collected ESI for relevance and privilege.
- Analysis – Evaluating ESI for content and context, including key patterns, topics, people and discussions.
- Production – Delivering ESI to others in appropriate forms and using appropriate delivery mechanisms.
- Production – Displaying ESI before audiences (at depositions, hearings, trials, etc.), especially in native and near native forms, to elicit further information, validate existing facts or positions, or persuade an audience.
The processes and technologies involved in eDiscovery are often complex, time consuming and costly because of the sheer volume of ESI that must be collected, reviewed and produced in a legal case. Additionally, unlike hardcopy evidence, electronic documents are more dynamic and contain metadata such as time-date stamps, author and recipient information, traces of its passage, and file properties.
- Preserving the original ESI content and associated metadata is the priority of the eDiscovery
- Process to eliminate claims of spoliation (destruction of evidence) or evidence tampering.
Once data criteria are identified by the parties on both sides of a legal matter, all potentially relevant content (both electronic and hard-copy materials) is searched for across all potential data repositories and placed under a litigation hold – a process to protect the content (and metadata) from modification or deletion. After a preliminary culling process to remove obvious duplicates or other non-relevant system files, the remaining data is collected and analysed to further cull the potentially relevant data set. The remaining content is hosted in a secure environment and made accessible to reviewers who evaluate and code every file for privilege, confidentiality or relevance to the lawsuit. All documents deemed responsive to the suit are then turned over to the opposing counsel.
For good cause, the court may order discovery of any content that is not privileged (or confidential) and is relevant to the subject matter involved in the suit – no matter where it resides. In layman’s terms, if ESI is potentially relevant to the case, you may be ordered to produce it. You should always keep in mind that anything and everything is potentially discoverable.
Since May 2018 there have been numerous complaints against big corporations, EU citizens began aiming at – Google, Facebook, and Oracle as well as smaller companies, to make the point that they are not kidding around. Many companies have already been fined for non-compliance. Google was fined $57 million for lack of transparency, inadequate information, and lack of valid consent regarding personalisation of ads. Google, Facebook, Instagram, and WhatsApp were hit with privacy complaints within hours of the GDPR taking effect. Companies today are realising that the information held on staff and customers needs to be managed differently and ring fenced using tightly controlled rules and regulations regarding retention and deletion of this data.
The fact is, there is hardly anyone doing it correctly which begs the question, Why? Companies subject to GDPR are either painfully unaware of current enforcement actions/fines or are choosing to ignore it. Again, Why?
- Companies believe GDPR does not apply to them because they do not have facilities in the EU (this is plain wrong);
- Companies assume it will be years before the EU members start to target non-multinationals (again, this is wrong – all it takes is for a single EU citizen to file an online complaint against your company to start the process.)
The days are long gone when companies can close their eyes and hope GDPR enforcement fades away – not to mention the risks for companies worldwide with the soon to be in effect California Consumer Privacy Act (CCPA.)
The most obvious red flag for individuals looking for a quick payday from the GDPR requirements is to troll websites looking for Data Privacy Officer (DPO) contact information. If a company doesn’t even have that, they are a prime target to go after.
Upon a simple complaint filing, EU member authorities will be knocking on your door wanting you to answer scores of questions about your data collection/retention practices on EU citizens. Again, eight months into the GDPR there have been over 95,100 complaints filed. Like Dirty Harry, arguably the leading philosopher of the 20th century, once asked; “Do you feel lucky…punk?”
GDPR Compliance Check List
We are now almost a year into the GDPR experience. Companies that have not addressed it seriously yet are playing with fire. Take a quick self-audit to see where your risks lay:
- Have you designated a DPO – data privacy officer, and listed them with their contact information on your website?
- Have you included opt-in and opt-out descriptions on your personal information collection forms?
- Have you thought about data sovereignty requirements around data movement?
- Have you fully considered the right to be forgotten and how you would conduct secure deletions?
- Does your company’s information management system support GDPR compliance functionality (policies and procedures)?
Data Classification using Machine Learning & Analytics
For a company to identify the good from the bad or obsolete, do not choose humans! They are not good at reading thousands of documents to find detailed information on a subject and then linking this information together.
The types of data we are analysing could be an office document, video, email PDF, written note or scanned document. When we are looking at information to tie this together through a search index, AI can search and index 100,000’s of files and file types to create a metadata analysis for everything it has found and then present every document containing your search analysis and report this to the user. For example, a video could be trans coded, and all the words spoken would be analysed and recorded within the metadata. Another use is that machine learning could be used to spot an object or colour and again record this information.
From the information that these systems gather a complete overview of information a company holds can be presented in a report to be analysed. From this we can for example remove a person from the system for GDPR purposes, discover possible problems relating to a security matter or in an instant provide lawyers with data relating to a legal case.
In order for a company to provide insightful analytics on it’s data it must have access to the information across it’s entire data storage estate including SAN, NAS, DAS, Email etc, otherwise you are only analysing the data on a single data silo and you could miss a crucial piece of information contained within an email.
The Data Lake – The Future
As we have explained within this document each file extension requires a different program to read the data contents. A data lake could be a cloud-based repository or a where all data resides in its native format under a global namespace. The data can be structured or un-structured and the amount of data stored could be unlimited and organised into large volumes of highly diverse information that could be analysed by advanced Machine Learning and Artificial Intelligence systems to provide a huge insight into the data and deliver detailed in-depth analytical reports.
A data lake could be used to fight cancer, if all the world’s institutions created a data lake containing all past and present discoveries of information on various types of cancer, the information within a single global data lake could be used to analyse globally findings relating to i.e. ovarian cancer and this alone would eradicate duplicate efforts be performed across the world on ovarian cancer cells. In turn this would lead to institutions be able to fine tune and report in an instant to the world their findings.
Another used for a data lake is it could be used to identify potential crimes within a geographic area or highlight areas for finding gold or oil using AI and 2D/3D scans of maps.
A data lake provides real time reporting and analysis of information. A user for example would view a PDF file on a PC using a PDF installed reader, within a data lake the information could be presented to the user via the cloud using a terminal window.
Archiving and eDiscovery
The enforcement of discovery obligations has emphasised the need for businesses to better control all their electronic data, including email, in a more systematic way.
The result has been the development by companies of information governance programs and the adoption of active data solutions. Over the past decade, arguments have been made that ongoing archiving reduces the costs and risks inherent in the discovery process by significantly reducing the time, effort and uncertainties usually required in the identification, preservation and collection stages of the EDRM process.
We provide hardware and software data solutions that assist in providing the migration of email and file data to local storage or the cloud, enabling companies to meet demanding compliance requirements and address eDiscovery requests effectively and easily. Ideal for Companies using cloud-based email services like Office 365 or where data resides on storage silos. Business data would transparently move to the cloud and it would be indexed, catalogued and searchable the same as if your IT systems were still in-house organised, accessible and secure.
Well at Fortuna Data we have a SaaS (software as a service) solution that provides all the advantages above in a simple, affordable and easy to use application. We also provide Office 365 backup for customers as a separate solution and Exchange to Office 365 migration solutions.
We provide webinars, product demonstrations and site visits to discuss your requirements in more detail.
“If the data, you are storing doesn’t have a value why are you keeping it?”