AWS FSx with DFS Replication

, , ,

Recently, I’ve been working on a storage migration for a customer who wants to use AWS FSx with their existing Windows Distributed Filesystem (DFS) to act as a highly-available backup server, should the local file servers go offline.

Windows DFS is a set of services offered on Windows Server that allows you to organise multiple SMB shares across multiple servers into one shared SMB namespace.

DFS supports replication (DFS-R), which can replicate SMB shares across multiple servers that share the link in the namespace, allowing for a highly-available filesystem.

The goal

My customer wanted had existing plans to use AWS FSx for Windows as a centralised server to sync with all of the DFS SMB shares in the namespace, so that the employees can access the DFS as usual through the FSx Filesystem should any of the local site file servers go offline.

DFS-R use-case

Many businesses use DFS-R because it’s reliable for it’s use-case – file replication. DFS replication was useful back in the days of slow internet, but in the modern day, with high-speed internet and unlimited data being the norm, is there a need to replicate two file servers?

Remember, that a solution like DFS-R shouldn’t be used for backup, it can be used for availability, but there are other solutions for that, as I’ll discuss below.

Investigate alternatives to using DFS-R, as it could be eliminated from your environment, saving lots of headaches and improving reliability.

Take a look at FSx File Gateway, which can be used as a local, low-latency, hot cache in-front of an FSx file gateway. You should also look at Direct Connect, to connect your on-premises environment with AWS using a high-speed, possibly dedicated link, allowing you to access cloud-based filesystems with high speeds.

Filesystem version limitations

FSx has a few filesystem versions available:

  1. FSx Version 2 Mulit-AZ (SSD or HDD)
  2. FSx Version 2 Single-AZ (SSD or HDD)
  3. FSx Version 1 Single-AZ (SSD only)

It’s important to note theses versions because only one of these versions is compatible with Windows DFS Replication: FSx Version 1 (you can still use version 2 filesystems with DFS, you just can’t use replication).

In most cases, being restricted to an SSD-only filesystem isn’t a problem because you need the high IOPS for the small files DFS replicates across your file servers.

However, in a cloud migration, it’s unlikely you’ll need to use DFS-R because you’ll be:

  • Migrating from an on-premises file server to a cloud file server
  • Using an cloud file server as a backup for an on-premises file server

In both situations, there are more appropriate alternatives to DFS-R. You can utilise DataSync to regularly sync filesystems, if you need real-time replication, you should consider using a highly-available filesystem like a Version 2 multi-AZ system.

Problems with DFS-R

During this project, I encountered a few issues impacting the DFS replication. While the issues may not have been a direct result of DFS-R, their impact on DFS-R was alarming enough for me to raise them as concerns here:

  1. You have limited control over DFS-R, if a file replication job fails for some reason, you have limited visibility to triage the issue, even on an on-premises Windows Server.
  2. There is no in-built monitoring for DFS-R and no immediate indication that DFS has stopped replicating. You must implement your own monitoring solution to repeatedly test the replication and ensure the files are still syncing.
  3. Specific to FSX – you have even worse visibility over DFS replication, we encountered a scenario where the replication simply failed, and even AWS support were stumped. We had to rebuild the filesystem and then… mysteriously… it worked. AWS are sun-setting version 1 filesystems for a reason.

Setting up DFS-R with FSx for Windows

If you haven’t been scared away from using DFS-R, then here’s how you can implement it yourself:

  1. Allow all standard ports used by DFS replication outbound and inbound on your filesystem’s security group (445, 135, 5722). Refer to this AWS article for more assistance.
  2. Configure your SMB shares in FSx for Windows. If you use AWS DataSync, you will need to take additional care to verify the ownership of directories, as we found some directory NACLs were migrated incorrectly, causing hidden issues with DFS replication.
  3. Add your FSx filesystem into your DFS and to the server replication groups. If you want your FSx filesystem to be access in the event your local file server goes offline, make sure you set the local file server to primary, and configure your replication schedule and bandwidth.

From here, your replication should start working, you’ll want to test it every so often as advised earlier in the article, as there are no indicators in FSx if file replication fails.

Closing notes

I hope that you don’t experience as much of a headache with FSx as I did, and hope that my pain can be your gain.

In all seriousness, I learned a lot from this project, and I think FSx is an awesome solution.


Leave a Reply

Your email address will not be published. Required fields are marked *