One of the most compelling features of EMC’s DL3D appliances are their ability to schedule deduplication. Now, competitors are fond of portraying this flexibility as a weakness, but the truth is far different. In fact, this flexibility let your choose your priorities. What do you want to finish first? Backup? Or replication?
Lets take a look at how the systems behave, and see what the consequences are.
First, some background and assumptions. All numbers are real, achievable, and realistic in normal backup environments:
In addition to this, to clarify the terms:
So with this in mind, Data Domain is fond of saying that the only approach that makes sense is to do deduplication inline. The rationale here is that you must replicate the backup (nothing else makes sense to them). Further, both as a matter of fact, and an approach, they also believe that a backup isn’t finished until the replication is complete.
Hrmph.
I can’t say that I agree with that at all. Because there are really two distinct needs here: backup, and off-site copies. Backup is done when all the backup application finishes writing a copy of the backup data. Off-site is done when replication (or the truck!) finishes moving the data to an off-site location of your choice.
Generally speaking, we want to finish backups as fast as possible. Even with BCVs, snaps, clones, proxy backup servers, and the like, I still want to burden my servers and storage for as short a time as possible with backup tasks.
And your Recovery Point and Recovery Time Objectives (RPO and RTO) will generally determine how important it is to finish your off-site copies by a given time. With very demanding RPOs and RTOs, you will prioritize the creation of an off-site copy.
Given all that, lets look at the following chart that illustrates a backup and off-site copy window.
The first (light blue) arrow shows the time to complete a backup and the replication of the data to a second site. With everything happening at once, it takes a little longer than 8 hours to finish the job. So the backups that start at midnight finish a little after 8:00 in the morning.
The second (dark blue) arrows show the time to finish the backup, the deduplication, and the replication. So backup takes 4 hours (10 TB at 2.5 TB/hour), deduplication takes 4 hours (10 TB at 2.5 TB/hour), and replication takes 2 hours (500 GB at 250 GB/hr; where 500 GB would be representative of a 20:1 deduplication ratio of the 10 TB of backup data).
(A note for technical accuracy: I have chosen the represent the deduplication and off-site copy in the 2nd case of scheduled deduplication as two jobs for the sake of illustration. In practice, they are a single process, just like replication of data when it is deduplicated in-line. However, by illustrating the example in this way we can more exactly see what happens, and how long it takes. The times are accurate. However, it is really only one process/task on the DL3D–it does NOT deduplicate and then replicate after deduplication is finished!)
So, in-line deduplication finishes the whole job, on site and off site backups, in 8 hours and a few minutes.
Scheduled deduplication finishes everything in 10 hours–about 1.5 hours longer than in-line deduplication. But the backup finishes in less than half the time. The backup is done by 4:00 am.
The real value of the ability to choose in-line deduplication or scheduled deduplication is this: you get to choose which you want to finish first: backup, or replication. The choice is yours. To me, there are good reasons for making either choice. SLAs, RPOs, RTOs, internal priorities, backup methodologies, and so on will all factor into the decision. But at the end of the day, to say that only one approach makes sense is just wrong. Different users will have different priorities, and will weight all those factors differently.
EMC’s offering lets you make the choice. And even if you elect to do in-line deduplication (the only option with Data Domain’s offering) at least you can still do it on a platform that is faster, more scaleable, and more reliable.