I've been battling other problems (like lingering objects when we've never done a restore from a backup tape or had any of our DCs offline for more than a couple of days) when I ran across one of those annoying "RPC Server Unavailable" messages from our replication monitoring.
I was already working with Microsoft on the whole "where are these lingering objects coming from" thing, so I dragged them into the muck with me.
I thought our DNS architecture was beautiful and bullet-proof. I was wrong.
Note: I've very much simplified our "architecture" in this example. In reality, we have 36 domains and around 582 domain controllers. (You don't want to know the real and extremely complex design.)
So we got the message "RPC Server unavailable" when DC2 tried to suck data from DC3. DC3 could suck data just fine from DC2.
Both DC2 and DC3 are members of an AD domain called weird.test.org. The DNS zone for weird.test.org is AD-integrated. There is one tree in the forest of test.org. All the DNS zones, in fact, are AD-integrated, and all the DCs also run DNS. All DNS servers are DCs.
DC1 is in test.org, the root domain of the AD tree. DC1 carries the test.org DNS zone and the DNS delegation for the weird.test.org DNS zone, which is delegated down to DC2 and DC3.
So you have:
DC1.test.org, carries the test.org DNS zone and delegation for weird.test.org
- The _msdcs.test.org DNS zone also lives on DC1.
- DC1 points to itself for DNS resolution.
- DNS is configured to forward to some Internet gateway DNS servers for any zones it does not carry.
- weird.test.org is delegated to DC2.weird.test.org and DC3.weird.test.org.
- DC2 points to DC1 for DNS resolution.
- DNS is configured to forward to DC1 for any zones it does not carry.
- DC3 points to DC2 for DNS resolution.
- DNS is configured to forward to DC2 for any zones it does not carry.
At one point, DC3 had been on old hardware in a different site. We demoted and booted out the old DC3 and brought up a new DC3 on new hardware with a different IP address. This was several months before this incident.
When we pinged DC2 from DC1, it pinged fine using the fully qualified domain name, e.g. DC2.weird.test.org. But when we pinged the GUID _msdcs record, e.g. abe33cdf-edd2-3456-abc1-acbd12398745._msdcs.test.org, we got a the wrong IP. WTF?
I went through every DNS server, every DNS server on the delegation for DC2's zone, weird.test.org, and checked the host A records for DC2 on every DNS server with a copy of the zone.
All the DNS servers had the right data. The Host A records were correct. The PTR were correct. There were no duplicate Host A records. There were no duplicate PTR.
We cleared all caches (cleared the DNS server cache and did an ipconfig /flushDNS).
We did a network trace to track who was the culprit with the bad IP information for DC3.weird.test.org. The trace showed that the name resolution process never left DC1. Even though DC1 was not authoritative for the weird.test.org zone and therefore did not have the DC3.weird.test.org Host A record to be able to come up with an IP for DC3.
Because DC2 uses DC1 for DNS resolution, it asked DC1 for resolution of abe33cdf-edd2-3456-abc1-acbd12398745._msdcs.test.org.
The _msdcs.test.org DNS zone contains cname records for the domain controllers (DCs), so abe33cdf-edd2-3456-abc1-acbd12398745._msdcs.test.org is a CNAME that resolves to: DC3.weird.test.org.
DC1 therefore looked for a delegation to weird.test.org.
DC1 found the delegation to weird.test.org. But it also found the name it was looking for on the delegation.
"A-ha!" DC1 said. "I don't need to actually query any of the authoritative DNS servers listed on the delegation. I have the IP right here on the delegation itself."
So DC1 returned the IP listed on the delegation rather than asking DC2 or DC3 (the two DCs authoritative for the zone).
And unfortunately, because we had replaced DC3 and changed the IP at that time, this information on the delegation never got updated.
Moral of the Story
It should have occurred to me that if DC1 needs to resolve the IP for a DC/DNS server listed on a delegation, it will just use the IP on the delegation and not actually send a request for resolution to the DC/DNS servers authoritative for the zone. It makes sense that it does that--why forward a query to the DC when you already have the IP? It was just unexpected.
I never thought to look at the delegation to find the bad IP DC1 was handing out.
And it also embarrassed me, because I thought we had corrected that IP when we replaced DC3.
Would-a; could-a; should-a.
Oh, well. One more learning experience.