Thursday, July 17, 2008

Domain Controller Replacements

Were in the middle of replacing all 550+ DCs in our Active Directory environment with new hardware.  Because some developers and applications are hardcoded to use certain DCs and since the DCs are also our DNS servers, we did not want their IP addresses or names to change.  If we changed their IPs, for example, wed have to change the DNS entries on all the servers TCPIP NIC configurations, as well as the scopes in DHCP.

This isnt too bad, because we worked out a step-by-step process to demote, rename, re-ip, the old systems before we tear them down completely.  Then we can bring up the new DCs with the original name and IP.  Its a lot cleaner than doing DC renames later and much less fraught with difficulties.

Except.Ive run into fun replication issues with stubborn metadata and KCCs.

THE SETUP

·       Normally, if we make a change, convergence for our entire AD infrastructure takes about 1 hour.  A recent AD health check by Microsoft confirmed this.

·       We have two DCs at every site for redundancy.

·       All DCs are GCs except 2 per domain:  the infrastructure master role holder; and a special DC at a central site we use for backups (the ntds.dit is smaller and easier to backup if it is not a GC).

·       The infrastructure master is always located at a domain hub site.

·       The second DC at the domain hub site is a GC, PDC-emulator role holder, and RID master role holder.

·       We have a hub-and-spoke replication topology for each domain, centered around a site with excellent WAN connectivity.  That hub then replicates with our site as the national hub.

·       Typically, there are anywhere from 5 to 10 sites within a domain.  Some have more, though none have less than 5.

·       All DCs are DNS servers, carrying their domains AD-integrated DNS zone as well as some other legacy zones and the standard root zone.

THE SCENARIO

I demoted and took out the old DC/GC at a domain hub site. 

When I tried to promote the new hardware, I got the message Cant join the domain, user already exists.  (Of course, the user is the computer, in this case.)

Ive had those errors before and it is invariably one of three things:

·       Debris left over in Sites & Services.  If you look at the site, you might see the old DC you promoted still there as an object, but it wont have any connector objects.

o       You can just delete the DC object *IF* you expand it and there is *no* NTDS Settings and no connectors listed in Sites & Services.

o       If the NTDS Settings/connectors still exist under the DC object in Sites & Services, youll need to perform a forcible removal via NTDSUTIL, which Ill discuss a little later in this blog.

·       The old DC may still be listed as a name server in DNS on the domain DNS zones Name Servers tab.

o       Open DNS and select the domains DNS zone.  Right-click on the zone and pick properties to look at the Name Servers tab.

o       If the old DCs name is still listed as a name server, remove it.

·       The old DC may have left an old computer account in  Active Directory Users and Computers (ADUC) and you need to delete the old account.  (Thats why we usually rename the computer after the demotion, but before we take it down hard for the last time.  If you rename it, there should be no old account left in ADUC with the same name.)

WHAT I DID

But this time, I checked all the above things, and it looked clean.

So I opened the DCPROMO log, %windir%\debug\dcpromoui.log and went to the bottom.  I discovered which DC it was talking to, to sponsor its addition into the domain.  (Note:  to find the sponsor, search for:  Enter MyNetJoinDomain)

I checked the sponsoring DC and found that it still listed the old DC in Sites & Services *and* it preferred that old DC as its replication partner, even though it no longer existed.  And there was nothing I could do in replmon, repadmin or Sites & Services to force the KCC to give up replication with the dead DC and establish a connection with the remaining DC in the site.

So I did a forcible removal of metadata about the old DC by using NTDSUTIL.  (Ill list that process further down) with the focus set on the stubborn sponsoring DC.

Tried to DCPROMO the new DC againno go.  Checked the log again and found it had selected a different sponsoring DC from another site.  This new sponsoring DC also refused to give up its replication connector to the old, removed DC.  So I had to do NTDSUTIL again to remove metadata on that system. 

HOW I DID IT

Here, in a nutshell, is how to remove a demoted DCs metadata so that the KCC will stop trying to create connectors to DCs that no longer exist, and so that you can reuse a domain controller name (if you wish to).

Oh, you have to be at least a domain admin.

And you have to do this on a DC in your domain.

1.      Open Sites & Services / Expand the target site / Expand the target DC you want to remove

2.      Check for the NTDS object beneath the DC server object and connections within that

3.      If the NTDS object does *not* exist, just delete the DC server object and youre done.  Skip to the end.

4.      If the NTDS object exists, *continue*

5.      At the command prompt, enter:  ntdsutil

6.      Enter:  connections

7.      Enter:  connect to server servername
where
servername is the FQDN (myserver.subdom.dom) of a DC in the domain youre working with

8.      Enter: quit

9.      Enter: select operation target

10.     Enter: List sites

a.      Scroll through the sites to find the site containing the stubborn DC server object

b.      Enter:  select site sitenumber  
where
sitenumber is the number of the site containing the stubborn DC server object

11.     Enter:  list domains

a.      Scroll through the domains to find the domain containing the stubborn DC server object

b.      Enter:  select domain domainnumber  
where
domainnumber is the number of the domain containing the stubborn DC server object

12.     Enter:  list servers in site

13.     Enter: select server servernumber
Where
servernumber is the stubborn DC you want to remove

14.     Enter:  list current selections
VERIFY that you have selected the DC you want to remove

15.     Enter: Quit

16.     Enter:  remove selected server

17.     Read the popup window and VERIFY what you are going to delete

18.     Click on [Yes]

19.     Enter: quit
Keep entering
quit until you exit ntdsutil

20.     Go back to Sites & Services

21.     Make sure the stubborn DC now has no NTDS object under it.

22.     If the stubborn DC is now clean, delete the stubborn DC in Sites & Services

FINAL EXPLANATION

Because I was working with a DC at the main domain hub site, and that DC was the only GC at that site, the KCC on all the other domain DCs preferred that DC/GC.  When the DC/GC was demoted, there was nothing at the other end of that connector.  The sites preferred to keep that connector (even though I deleted connectors and forced the KCC to rerun)

Because the other DCs in other sites were only trying to communicate with the demoted DC/GC, they did not replicate the metadata that indicated that the old DC/GC was gone.  So the KCC would always regenerate connectors to the dead DC, which in turn meant DCs other sites never got the news about the demotion.

I finally had to use the ntdsutil method on every promotion-sponsoring DC, which was basically one DC at each site in the domain (the DC elected as the replication bridgehead) before the dcpromo would agree that the DCs name was free to use to join the new DC to the domain and promote it.

Whew.  What a pain in the neck.

Sincerely,
Amy G. Padgett

Wednesday, July 16, 2008

New Child Domain: Stupid Error

Here is one of those smack-the-forehead-Im-so-stupid errors.

Youre setting up a new child domain in your Active Directory (AD) infrastructure and while running the DCPromo wizard, you cant complete the promotion because you get Access Denied (5) when it tries to replicate the schema partition.

On the system youre trying to promote, you open the event log for the Directory Service, and you see three errors:

Event ID: 1168

Source:     NTDS General

Category:  Internal Processing

Event ID: 1125

Source:     NTDS General

Category:  Setup

Description:

     The Active Directory Installation Wizard (DCPromo) was unable to establish connection with the following domain controller.

     Domain controller:

             Mynewdc.subdom.dom

 

     Additional Data

     Error value:

     5 Access is denied.

Event ID: 1168

Source:     NTDS General

Category:  Internal Processing

In the dcpromo log (%windir%\debug\dcpromoui.log)  you also see access is denied when it tried to replicate the schema partition.

If you check the domain controller (DC) holding the domain role (the FSMO role: domain role owner) in the root domain  you will find corresponding Kerberos errorsdue to time synchronization.

DUH.

If youre like us, you did not join the server to any domain before running DCPromo becausewhat would be the point?  The new child domain doesnt exist yet.

But if its not a member of a domain yet, it cant pull time from the PDC-emulator role holder in its domain.  And you probably havent bothered to configure the time source on it because that will change as soon as it IS a DC and youll go to NT5DS to grab your time from your domain structure.

In our case, we set the time zone (correctly) and the time *looked* correct, until I realized it was set for AM instead of PM.  Not to mention, the day/date was WRONG.  DUH-Ditto.

Simple problem, simple fix.  So simple that I found it was completely NOT documented anywhere accessible to google searches.  And the real trick was remembering to check the event log on the domain role holder in the root domain.  Who would have thought that the defining event explaining all the answers would be on that system?

Well, really, I should have guess that to begin with and saved myself a lot of head-scratching.

So there you are.  A blindingly simple answer to a weird problem.

New Blog

Im starting a brand new blog today for all those folks trying to support a Microsoft Active Directory enterprise.  I know I get frustrated trying to find information about some of the problems Ive run into.  And after I fix the problem, I (of course) forget to document it since we have no change management system at work.  So Im hoping my blog will at least help me.

And it really is pretty shocking about the lack of a change management system at work, but hey, Im not in charge.  We do, at least, keep track of schema changes.  Im only one of four Enterprise Admins managing an Active Directory infrastructure consisting of:

1 Forest with 1 Tree (so far, so good)

32 Domains (not so good)

350+ Sites

550+ Domain Controllers

340,000+ Users

Our AD has over a million objects at this point in time.

Were pretty big.  And pretty complex, although weve done our utmost to keep the basic AD structure and operation as vanilla as possible.  Otherwise, the upgrades would eat us alive.

The things I touch and/or manage include:  DNS, AD, the DCs (replication, NTFRS, etc), Sites&Services, Domain Trusts, and probably a lot of other junk I cant think of.  Oh, yeah, theres WINS, too.  I try to stay out of the desktop/application server arena, but hey, all problems are AD isnt working problems, right?  Sheesh.

There is a Help Desk and Im in the third (final) tier of support.  If I cant solve the problem, I kick it up to Microsoft (and spend the next five hours trying to explain our environment and what Ive already done to try to resolve the problem).  So Im sort of the last outpost of civilization before it goes to Redmond. 

Well, Im not alone.  There are 3 other EAs and one of them is super-smart.  Im not that one.  Im just doing the best I can.

So this starts my attempt to explain what Ive been learning along the way.  Maybe it will help a few folks out.  Maybe it will just confuse people more.

If you tell the truth you dont have to remember anything. Mark Twain

Amy Padgett