Deduplication and Replication

03 October 2008 | Categories: Data Technology
Tags: ,

One of the most compelling features of EMC’s DL3D appliances are their ability to schedule deduplication. Now, competitors are fond of portraying this flexibility as a weakness, but the truth is far different. In fact, this flexibility let your choose your priorities. What do you want to finish first? Backup? Or replication?

Lets take a look at how the systems behave, and see what the consequences are.

First, some background and assumptions. All numbers are real, achievable, and realistic in normal backup environments:

  • In-line deduplication is 1.44 TB/hr over IP (true for both the DL3D 3000 and the DD690)
  • Scheduled deduplication is 2.5 TB/hr over FC (for the DL3D 3000)
  • IP replication runs at 250 GB/hr (this is really bound by link speed, latency, and so on, so I have chosen a modest figure representative of a single Gigabit IP link with low-ish latency)

In addition to this, to clarify the terms:

  • In-line deduplication happens as the data is written to disk. This is an option you can choose for each VTL or NAS share you emulate or have available on a DL3D 3000. It is the only choice on a Data Domain box.
  • Scheduled deduplication means that the data will be written to disk, and after the backup is complete, a deduplication process will begin that reduces the amount of data stored by deduplicating it.
  • With either approach, replication cannot begin until deduplication does. Only data that has been deduplicated is eligible for replication. For our purposes, I have assumed that replication will slow down in-line deduplication modestly to 1.2 TB/hr.

So with this in mind, Data Domain is fond of saying that the only approach that makes sense is to do deduplication inline. The rationale here is that you must replicate the backup (nothing else makes sense to them). Further, both as a matter of fact, and an approach, they also believe that a backup isn’t finished until the replication is complete.

Hrmph.

I can’t say that I agree with that at all. Because there are really two distinct needs here: backup, and off-site copies. Backup is done when all the backup application finishes writing a copy of the backup data. Off-site is done when replication (or the truck!) finishes moving the data to an off-site location of your choice.

Generally speaking, we want to finish backups as fast as possible. Even with BCVs, snaps, clones, proxy backup servers, and the like, I still want to burden my servers and storage for as short a time as possible with backup tasks.

And your Recovery Point and Recovery Time Objectives (RPO and RTO) will generally determine how important it is to finish your off-site copies by a given time. With very demanding RPOs and RTOs, you will prioritize the creation of an off-site copy.

Given all that, lets look at the following chart that illustrates a backup and off-site copy window.

Picture15

The first (light blue) arrow shows the time to complete a backup and the replication of the data to a second site. With everything happening at once, it takes a little longer than 8 hours to finish the job. So the backups that start at midnight finish a little after 8:00 in the morning.

The second (dark blue) arrows show the time to finish the backup, the deduplication, and the replication. So backup takes 4 hours (10 TB at 2.5 TB/hour), deduplication takes 4 hours (10 TB at 2.5 TB/hour), and replication takes  2 hours (500 GB at 250 GB/hr; where 500 GB would be representative of a 20:1 deduplication ratio of the 10 TB of backup data).

(A note for technical accuracy: I have chosen the represent the deduplication and off-site copy in the 2nd case of scheduled deduplication as two jobs for the sake of illustration. In practice, they are a single process, just like replication of data when it is deduplicated in-line. However, by illustrating the example in this way we can more exactly see what happens, and how long it takes. The times are accurate. However, it is really only one process/task on the DL3D–it does NOT deduplicate and then replicate after deduplication is finished!)

So, in-line deduplication finishes the whole job, on site and off site backups, in 8 hours and a few minutes.

Scheduled deduplication finishes everything in 10 hours–about 1.5 hours longer than in-line deduplication. But the backup finishes in less than half the time. The backup is done by 4:00 am.

The real value of the ability to choose in-line deduplication or scheduled deduplication is this: you get to choose which you want to finish first: backup, or replication. The choice is yours. To me, there are good reasons for making either choice. SLAs, RPOs, RTOs, internal priorities, backup methodologies, and so on will all factor into the decision. But at the end of the day, to say that only one approach makes sense is just wrong. Different users will have different priorities, and will weight all those factors differently.

EMC’s offering lets you make the choice. And even if you elect to do in-line deduplication (the only option with Data Domain’s offering) at least you can still do it on a platform that is faster, more scaleable, and more reliable.


Comments

Leave a Reply:

Name *

Mail (hidden) *

Website





Deduplication: From Point Solution to Data Center Strategy

SyncToy – PC Magazine Best Free Software of 2010

Checklist for Security of Data Recovery Service Providers

Professional Data Recovery Services

Data recovery firm lists its strangest cases

Data Recovery and File Recovery Tools

Data Recovery Is Achievable

Microsoft ODBC 3. 0 Software Development Kit and Programmer’s Reference: Everything You Need to Build Easy Database Connectivity Into Your Applications

Top Computer Forensics Schools

Top Reliability with Seagate’s New Savvio 10K Drive

Hard Drive Data Recovery

Data Recovery Glossary (Letter U)


allQoo SEO Posts

Get Adobe Flash playerPlugin by wpburn.com wordpress themes