Tuesday, May 8, 2012

Troubleshooting Exchange 2010 DAGs Across WANs

After we established Exchange 2010 LAB across three sites, Exchange 2010 Database Availability Groups across multiple sites and WAN connectivity work most of time. However, once our network connectivity experiences issue between sites, I recently found myself troubleshooting an inconsistent issue at one remote site.
 
The issue reported by the following link, however, it seems there is no resolution yet.
http://social.technet.microsoft.com/Forums/eu/exchange2010/thread/40a37573-d967-41b0-b0d2-8a9f7ae530eb 
 
you can see our network (Exchange 2010) settings via the following link:
 
 
Once this happened, from remote site server EMC, Mailboxes Database Status across WAN is shown as Unknown and from NY and WM (site)  servers, copy status for all database copies on SIXLABMBX-1 as "ServiceDown." Running Get-MailboxDatabaseCopyStatus against the DAG member(s) in the remote data center reflected the same results. Databases in an "Unknown" mount state corresponded to cases where the database was activated in one data center and status was being queried across the WAN from the other data center.
There were several Windows event logged on both SI and NY servers:
 
Event ID 2060
 
The Microsoft Exchange Replication service encountered a transient error while attempting to start a replication instance for NYXLABMBX1DB-1\SIXLABMBX-1. The copy will be set to failed. Error: The NetworkManager has not yet been initialized. Check the event logs to determine the cause.

Event ID 2153
 
The log copier was unable to communicate with server 'NYXLABMBX-1.exlab.randomhouse.com'. The copy of database 'NYXLABMBX1DB-1\SIXLABMBX-1' is in a disconnected state. The communication error was: Communication was terminated by server 'NYXLABMBX-1.exlab.randomhouse.com': Data could not be read because the communication channel was closed. The copier will automatically retry after a short delay.
 
Event ID 2058
 
The Microsoft Exchange Replication service was unable to perform an incremental reseed of database copy 'NYXLABMBX-1DB3\SIXLABMBX-1' due to a network error. The database copy status will be set to Disconnected. Error An error occurred while communicating with server 'NYXLABMBX-1.exlab.randomhouse.com'. Error: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
 
This doesn't happen between WM and NY servers.
 
We found out the only solution for this is to reboot the server one by one, then DAG will show healthy after that. That's a big headache.
 
After digging and digging, we have found a link that talking about MTU (Maximum Transmission Unit). By default, Windows NIC is automatically set MTU (1500), and Windows operating system uses MTU (Maximum Transmission Unit) determines the maximum size of the largest protocol data packet unit (including the size of the transport header) that can be transmitted over the underlying network layer. MTU parameters usually appear in association with a communications interface (NIC, serial port, etc.), and is configured separately for each network interface. We can test MTU settings across WAN by pinging remote IP with -f -l (packet size) option,
For example, from NY server, we can ping 10.102.52.18 -f -l 1200,  where 1200 is packet size, we want to test…, then try 1300, 1400, 1500 …
By default, router's MTU is set between 1480 to 1500, deducting some packet transport headers, from NY to WM server, we can ping wmxhub-1 –f –l 1470 with no problem, failed 1475. That's normal. However, from NY to SI site, it only allow maximum packet size around  (1365), it failed 1370 (ping 10.102.52.17 -f -l 1365, OK and ping 10.102.52.17 -f -l 1370 failed). We talked with Network team about this, I was told that this is due to fact that we have private circuit between WM and NY, and only VPN connection between SI and NY that offset a bigger packet overhead. Usually server and network equipment will adjust packet size automatically.
We use "netsh interface ipv4 show subinterfaces " command to check default MTU size on the mailbox servers that are member of DAG and found out default NIC MTU is 1500, while Microsoft set Cluster NIC MTU to 1300. Since we don't use dedicated NIC for replication traffic, thus the server will use default NIC to transport replication traffic, and it seems Windows cannot automatically adjust MTU accordingly.
 
Then we use the following command to set default NIC MTU to 1300
 
netsh interface ipv4 set subinterface "Local Area Connection 3" mtu=1300 store=persistent
 
and use netsh interface ipv4 show subinterfaces to show MTU to make sure both cluster NIC and default NIC have the same size of MTU (1300)
 
After we did these on all mailbox servers that are member of DAG, then we restarted replication services on all DAG members, the issue has been resolved automatically.
 
 

9 comments:

  1. Thank You, this has been driving me mad. Great fix.

    ReplyDelete
  2. Thanks for this great information! You just got me back up and running again!

    ReplyDelete
  3. This is impressive and also great information. I personally liked going through your solid points on this topic. Many thanks for creating such excellent material. This is excellent. medicinenet diabetes mellitus

    ReplyDelete
  4. This worked a treat for me as well using Exchange 2013 across WANs. I changed the MTU to 1300 same as the cluster MTU and things worked great! Thanks! Back in business.

    ReplyDelete
  5. These are actually fantastic ideas in on the topic of blogging.
    You have touched some pleasant points here. Any way keep up wrinting.



    my web blog; Dirk Craen

    ReplyDelete
  6. Thanks for the sharing. It has solved our issue.

    ReplyDelete
  7. Thanks for sharing, fixed our issue aswell

    ReplyDelete
  8. Thanks for sharing! Sruggeling with multiple replication errors that were only visible from management shell. set MTU to 1300 for all Interfaces. hope, that this will fix those nasty errors permanently. Cheers!

    ReplyDelete
  9. Thanks for your helpful information, it worked for me with Exchange Server 2010 SP2. That's great!

    ReplyDelete