Last week I was at the Hyper-V.nu event at Microsoft Netherlands HQ in Amsterdam.
Ronald Beekelaar (MVP Virtual Machine aka MVP Hyper-V) gave a Data Deduplication Deep Dive session.
This was a very good and highly technical session, which got me thinking… and I decided to write a little article about this new technology in Windows Server 2012 (Windows Server “8”).
Introducing: Deduplication in Windows 8
With Windows Server 2012 Microsoft introduces a built-in software based data deduplication (dedupe) solution. Where several storage providers offer such solutions, Microsoft has taken another approach by providing a solution for duplicate data from an operating system level instead of a storage level. Where some deduplication solutions provide their services file-based, the deduplication offered in Windows Server 2012 does this block-based. More on that later on…
Now, let’s take a few pointers before we start looking at dedupe in Windows Server 2012:
- Only available in Windows Server 2012.
- Is cluster aware.
- Based on a filter driver per volume.
- Not supported on boot- or system volumes, only intended for data storage volumes.
- Does not work on compressed or NTFS encrypted files.
- Dedupe requires an NTFS file system and is not supported for the new ReFS file system which is introduced in Windows Server 2012.
- Does not work with Cluster Shared Volumes.
- Does not work with encrypted files, files smaller than 64KB, re-parse points or files with extended attributes.
- Not configurable through Group Policy.
- It is a post-process deduplication process.
- Windows caching is dedupe aware.
How does it work?
For me this is always the most fun question to ask… because when you know how it works, you can understand the use case and the possible gotcha’s when designing an environment that makes use of this technology.
Dedupe looks at the storage from a block-based point of view and divides the storage on ‘chunks’ which are typically somewhere between 32 and 128 KB in size with an average of 80K, although smaller chunks are possible.
To understand dedupe in Windows Server 2012, we first have to understand the concept of ‘hard links’.
When data is stored on a file system, the actual bits and bytes are stored on a single location. So, if some bits are the same… why save it multiple times? By using hard links you can refer to bits which can be used by multiple files.
Let’s clarify that one a little… When you have hundreds of *.docs files created by your HR department, they probably use some templates. This means that a lot bits and bytes in the files is exactly the same!
Since dedupe views the storage in chunks, it will notice a lot of those being exactly the same. So, instead of saving the bits and bytes multiple times, it saves the chunk only one time and creates hard links on all locations so they refer to the same data.
When you view the properties of the Program Files folder, you will probably notice that the values behind “Size” and “Size on disk” differ from each other.
This is because some hard links are used for files in this folder. So “Size on disk” involves the accumulated amount of bits and bytes by the files in this folder and “Size” equals the accumulates amount of bits and bites on the disk minus the bits and bites that are replaced by hard links.
Note that the example provided above is file based, where dedupe is block based and provides a far better utilization of the available storage.
I found the diagram below which clearly explains the basic concept of dedupe.
As you can see, some chunks (A, B and C) are used by both files.
By using a technology similar to hard links, but on block-level, all files can access the correct bits and bytes where they only need to be stored once instead of multiple times.
The dedupe process works through scheduled tasks, but can be run interactively by using PowerShell. More about that command later on…
Why use data deduplication?
A valid question… what benefits does dedupe provide? A lot of my customers require massive amounts of storage.
The purpose of dedupe is the better utilize the storage capacity which is available to you.
Microsoft has done some research in their dedupe technology and come up with some numbers on the storage savings dedupe provided:
General | 50-60% savings |
Documents | 30-50% savings |
Application Library | 70-80% savings |
VHD Library | 80-95% savings |
These numbers come straight from a vendor and these tests may be somewhat optimized for better results.
* As an economy teacher of mine always said:”You give me raw data and the results you want to come out of it, and I’ll provide you a calculation that offers the results you want…”.
Nevertheless these are some pretty impressive numbers! I would love to test this in a production environment and hopefully see the eyes of some IT guys grow, aswell as the smiles of IT managers, when they see the storage savings in their environment
But what about the performance? Any dedupe technology causes some sort of a performance hit, right?
Yeah, that’s true… also with the dedupe in Windows Server 2012.
Microsoft has offered some information about this.
Write actions have no direct performance hit since the dedupe process is done in the background when the system is idle.
Read actions do have a performance hit, around 3% when the file is not in cache.
The components of deduplication
Drivers are always ‘fun’ to troubleshoot and since the entire technology of deduplication in Windows Server 2012 is based on the filter driver, some understanding of the thing may be useful.
To do this, we have to look at the technology from an architectural view. Where the management of dedupe can be done by Server Manager, PowerShell and WMI, it only manages the dedupe service which in its turn manages the dedupe jobs.
Those dedupe jobs are the ones that talk to the dedupe filter driver that does the actual handling of the chunks of data on the file system. But when data is only stored once, the files will have to know where their data has gone to. That’s where the metadata comes in to play since this is where the location of all the bits and bytes is stored.
With ‘normal files the metadata will only have references to the regular storage. But when a file is affected by the dedupe process, the metadata will not only refer to the regular storage but also to some chunks in the chunk store.
The dedupe service can be scheduled or can run in a background mode while it waits for the system to enter an idle mode so that the system will not experience a negative performance in production hours. This is also called a post-process dedupe mode.
Dedupe and the GUI
The basic management features for dedupe are available in the GUI. Let’s do a quick walkthrough for enabling and configuring the dedupe feature in Windows Server 2012.
After installing the File Services role, add the Data Deduplication feature to that role:
Next, you can configure dedupe on a volume:
Now we get the option to configure some settings in dedupe, such as files and folders to exclude… but more interesting is the setting for the minimum amount of days a file must not been changed for the dedup process to pick up this file:
Dedupe and PowerShell
To enable dedupe we have to use my favorite tool: PowerShell.
The first task is add the deduplication feature which is part of the file system role. This can be done by using the Server Manager (GUI)… but where’s the fun in that? You can’t automate that… but by using PowerShell you can
To enable the deduplication feature by using (elevated) PowerShell commands:
Import-Module ServerManager
Add-WindowsFeature -name FS-Data-Deduplication
Now that the deduplication feature has been enabled, we can start configuring.
First, as with any other PowerShell module, we have to load the module. You can do this with the following command:
Import-Module Deduplication
To configure the dedupe feature on volume E on a device:
Enable-DedupVolume E:
Now that dedupe has been enabled and configured on a volume, we want to know some statistics such as what amount of storage we actually saved by using dedupe:
Get-DedupStatus
By default, the dedupe process will only affect files that have not been changed for 30 days. Especially in demo environments this can be a nasty gotcha… you probably don’t want to wait 30+ days for dedupe to start doing its thing…
So, to change this value to 0 (process the file a.s.a.p.) you can use the following command:
Set-DedupVolume E: -MinimumFileAgeDays 0
Normally the dedupe process is done through scheduled tasks in the Windows operating system… but you can start this process manually with PowerShell:
Start-DedupJob E: –Type Optimization
However, this job runs in the background and may take some time. To view the status of that job, the following command can be used:
Get-DedupJob
HELP!!! I’ve done something wrong and I have to disable dedupe on this volume!!
Don’t get your nickers in a twist… again, this can be done by using PowerShell Use this command to un-dedupe the volume:
Start-DedupJob -Volume E: -Type Unoptimization
If you are as enthusiastic about this feature as I am you can read the help for the dedupe PowerShell cmdlets by using this command:
Help Dedup
That’s nice! Good work done by Microsoft to keep an option of data deduplication in windows 8. Wonderful!
way too many restrictions. No CSV support, no running VM support or virtual desktops, no SQL server support, no Exchange support, smaller than 32k files are not deduped. Files that change frequently or receive a lot of IO are not supported under dedup.
Also what are the overheads? under normal circumstances it utilizes 25% of the servers memory and up to 50% if i run the job in throughput mode. Furthermore, it appears there maybe corruption issues as corruptions are logged into the event log and a scrub will run to attempt to fix the data chunks from alternate locations. There’s also a deeper scan that walks the entire data set looking for corruptions and tries to fix them.
My question is this…since very popular chunks have entire duplicate copies if the chunk is accessed more than 100 times what happens with corruptions that are not based on “popular chunks”? it seems as if if microsoft’s saying “if your data chunks aren’t “popular” you don’t need them..”
While this is a good attempt it seems to me that’s not fully baked yet which is not surprising given microsoft’s v1.0 features.
Hi there, would love to know what your disk/size on disk is for a folder after you run de-dupe.. on a drive full of movies with server 2012 i’m seeing a terrabyte dedupe down to 4gb.. scary.
Hi Sachiv,
The size on disk from a folder you can get by getting the properties of that folder (right-click and select properties)… and yes, the storage reduction can be scarry sometimes although 1TB down to 4GB is VERY scarry… that’s a rate I would only dream about 😉
Jeff.
I am highly surprised this tech is not baked into ReFS. I redid two arrays before I figured out why dedupe wouldn’t work on my new server. ReFS should have it so I can get the resiliency I want plus the post process dedupe. I don’t want to use NTFS for my new stuff but I guess I have to, in order to get dedupe.
Hi Steve,
I see ReFS as version 1.0 since I see lots of applications for it where it’s simply not supported for now. The next version is something I look forward to because I think Microsoft will expand the support and it will become very nice to apply it to any environment (if not so already because file servers are supported and allow for an amazing dedupe rate!). I don’t have any inside-information but the logical step to take, in my opinion, would be to expand no the support and applicability of dedupe (for example, combine it with ReFS). So I agree with you on the fact that it’s a shame it doesn’t work with ReFS (yet!)…
Jeff.
Thanks for this, very helpful. I recently implemented the dedupe and have posted the space savings here: http://blog.randait.com/2012/12/server-2012-dedupe-results/
I’m happy to read that you’ve found my post to be useful, thanks! 🙂
Enable background optimization – what are the benefits of having it on/off?
The optimziation task will be executed when the system is idle which avoids a negative performance impact during business hours.
Excellent post! Do you know, when you’ve added a folder to the exclusion list after it’s already been optimized, is there a way to unoptomize just that folder? I ran a manual optimization and the Inpolicy file number of files when down, but not the optimized file number. Please don’t tell me I have to delete the data off the volume, optimize the data and then copy it back, but if that’s the case, so be it.
Answering my own question: Start-DedupJob -Type Scrubbing -Volume F:
Hi JP,
In only 56 minutes you’ve found the answer to your own question, nice job! 🙂 Did you found the answer somewhere online or …?
Jeff.
hi ,
thanks allot for your post its very useful but i need to some help 🙂
please tell me what i need to do in order to fix my deduplication
i have windows server 2012 and the deduplication configured dedup process and its almost running 20
hour without starting the deduplication progress .
i configured it on my volume T: and enabled it via PS –> Start-DedupJob T: -Type Optimization -Full
if you ping my email its will be grate 🙂
Hi Ohad,
We’ve already had direct contact but for other readers I’ll also reply here.
Windows Server 2012: If all files are in use, the dedupe process won’t start. In your case you’ll want to shut down the VM’s and when the VHD(x) files are no longer in direct use by the VM, dedupe should start doing it’s thing.
Windows Server 2012 R2: http://blogs.technet.com/b/filecab/archive/2013/07/31/deploying-data-deduplication-for-vdi-storage-in-windows-server-2012-r2.aspx
Enable-DedupVolume C:\ClusterStorage\Volume1 –UsageType HyperV
Jeff.
Thanks for explaining dedup very easy and well to be undestood.
My question. You did not cover the Situation what happens if there are compressed files and Folders on a deduped volume. My understanding is that you only loose the space-saving because dedup cannot find many equal chunks any more. Are there some other consequences by doing so?
Michael
Hi Michael,
You are correct, dedupe doesn’t touch those chuncks. Therefore there should not be any consequences 🙂
Jeff.
Jeff,
From your experience can a dedupe enabled vhd be expanded? I have a home share server that is 1.3tb, dedupe running around 41% and 900gb savings. But I only have 25gb free. I want to migrate the entire server to a host with a SAN attached, then expand the VHD volume to 3tb.
Am I about to create a big headache?
Hi Adam,
If you’ve provisioned a dynamic VHD than there shouldn’t be a problem since the size on the disk isn’t actually located until data in the VHD expands to a larger size than the VHD actually is on disk.
If you have a static VHD, than you’ll probably get an error when you try to re-size it.
Jeff.
P.S. With dynamic disks it’s very, very (!!), important to monitor the disks. If they become full your servers will stop working and it’s a b*tch to fix 😉
It is a dynamic vhd, I wish it were a vhdx. So I could larger than 2TB, I could always convert it I guess. Yea I’ve been watching the size slowly creep down so I know I better act fast.
is this feature will help to remove duplicate files in file server SAN
Your explanation of the difference between ‘size’ and ‘size on disk’ for the Program Files folder is incorrect. The ‘size’ value will not decrease due to hard links from deduplication. Quite the opposite actually; the ‘size on disk’ will become smaller than ‘size’. What you’re seeing with the Program Files folder is actually due to the allocation unit size of the volume and a certain number of files smaller than that unit size consuming a full unit. Thus the ‘2.97 GB’ of data are actually needing ‘3.03 GB’ to be stored on that volume.
And to Sachiv and others that question the ‘size on disk’ of a deduped folder, don’t rely on that value as a measure of the dedup success. Deduped chunks is moved to the hidden system folder ‘System Volume Information’ at the root of the drive and hard links are left behind. What you’re actually seeing is the measure of unique data that could not be deduped. The correct way to view the space savings of your dedup is through powershell (run as administrator): Get-DedupStatus
Hello there. I have dedupe working on a hard disk where all my File Server data is stored. The feature is working very well, I can see the results looking at some files properties.
The problem is: The feature is storing “control files” or something like that, in the same logical disk. But these files are using more than 800GB in \System Volume Information\Dedup\ChunkStore\{D4267A78-0816-4B3E-ADB2-053A1CDD2816}.ddp\Data.
So, I am understanding that the feature is consuming more disk space than I do not use it. Dois it make any sense?
Best Regards.
Hi Jean,
I recommend contacting Microsoft Support, because this is not any behavior you would want to see.
To me it sounds like either a bug, or a mis-configuration of some kind.
Jeff.
Hi Jeff,
I have a file server with 4 TB drive and want to enable the deduplication. Will there be any impact on users while deduplication is running. Can my users use the files during this process.
Hello, I have found a GUI program to manage Data Deduplication for Windows.
Weblink is: http://www.orontesprojects.com/?page_id=371
Direct download is: http://www.orontesprojects.com/dedup/datadedup.zip
They deliver a free license to register the program…