Some things are best learned through experience.
The horror and frustration of them is a far more valuable lesson than the actual knowledge if read from a book. Losing data is one such lesson and there is really no way to make light of that event.
One of the horrors of losing data is that it can affect anyone.
Another such lesson, though, only affects network managers. Only plagues those who are responsible for the smooth operation of the network and the reliable access of resources for their users.
What makes this issue so insidious is that it is not obvious and it can plague the most well-prepared individuals; even more so those who are learning the eddies of troubles that managing a network can bring.
Imagine, if you will, a basic network built of the most basic networking equipment.
A network that is working perfectly and reliably. A network that maintains access to the core resources used by all of the users and which allows them to pass through to the outside world for general internet access. Imaging this should be no special task because, as you read these very words, you are using such a network or you used such a network to gain access to them. Such a network is seamless and the users make no notice of it just as they make no notice of the hallways or streets that they use until there is a problem.
The first indicator that I had that there was a problem was a routine helpdesk request. One of my users was unable to log on to the computer. Investigating the issue yielded a message that, at first glance, seemed a bit odd for my network but which was in the realm of possibility: an IP conflict.
The users could not access the network because the workstation had been shut off from the network because it had an identical address to another workstation. A simple reboot solved this problem and I verified that the workstation was getting its address from the network itself, as intended. I made a note of the address and began my investigation.
I checked my sheet of static addresses and found no matches.
I checked all of the devices that were supposed to have static addresses and they all had the addresses they were supposed to have.
I noted other machines around the building and none of them were having an issue.
I chalked the issue up to a random fluke and continued on.
Until another user informed me of the same problem the following morning.
And another.
And another.
And then several more.
I knew, then, that I had a serious problem on my hands but I had no idea what it could be.
I also knew that I had a limited window to find and solve the problem before my professional reputation was damaged and my "personal capital" in the organization would be completely and totally, and irrevocably, spent.
I examined the problem and found no inherent and obvious cause.
As I could not treat the disease I began preventative measures on the symptoms.
I reduced the range of assignable IP addresses and began statically assigning addresses to every affected user. I did this on the network side and hard-coded the addresses into the workstations themselves.
This, it turns out, was exactly the measure I needed to buy myself some more time.
With that time I turned to some online resources, which in the time of this tale were greatly diminished as compared to today, to get assistance on the root cause.
Reading and testing; more testing and more reading. I spent considerably time trying to understand where the problematic addresses were coming from but, regardless of my lack of comprehension they were still there. They still were being issued to anything that requested them. I was merely preventing things from asking for them.
The main clue that I received was that computers still pulling the addresses automatically were trying to use a different gateway than I had configured on my network. This little clue allowed me to deduce some more information, now that I had a few moments, as to the originating entity.
Most routers, if not all, have an interface that is visible to the network from the "inside." This interface is accessible by pointing a web browser at the address of the router. Unless otherwise specified this is the address that the router will hand out as the gateway.
So took the gateway address that I was seeing and entered it into the web browser of my affected machine.
BOOM. I was fed back a web interface to configure a router.
I now knew what I was looking for. I knew what the device was and how it was destroying my network.
What I didn't know was WHERE it was nor who had installed it in my building.
Somewhere, in my domain, was a Rogue Router.
And so began the great hunt.
After hours I went on safari, seeking the beast that was destroying the stability of my equipment and which was poised to consume my career prospects.
There was no better way, with the limited equipment I had to work with, to hunt this beast than to roam the vast halls; exploring room by room. I sought the beast.
For three nights I spent time migrating through the wilds of the network, physically examining the devices attached to the network in each of the many rooms I connected.
Until I beheld it.
In a room on the back wing I found a table.
On that table lay four additional workstations.
Four machines that were not supposed to be there.
Four machines that were not in my inventory.
Four computers that, combined with the other two in the room, could not POSSIBLY, connect into the four network ports that existed in that room.
Six machines simply could not use four network connections: the math just did not work.
I climbed under the table and followed the leads out of the wall. I moved around the edge of the table so that I could examine where the lines went as they moved from beneath to above the table.
I traced the line and I found my quarry.
Buried in a nest of cables it lay, lurking and waiting. Eating the productivity of all and consuming my reputation little by little.
There was the netgear router that was not supposed to be there.
It had one port that was not in use with one feeding into the wall and the remainder going to the computers. But the one that was not in use was the NOT a downstream port. It was the uplink.
Whomever had installed this router didn't know what they were doing and had uplinked my entire network into this router.
This was the source of the wrong addresses. This was the source of my pains.
This was the cause I consumed so much Tylenol in the previous week.
I unseated the cable that fed to the wall from the port it was nestled into and I seated it into the uplink port.
I made this change and I waited.
A week passed without any additional problems.
A week in which I could not find any trace of the issue.
A week in which I rebuilt my confidence and allowed others to believe I had contained, and eliminated, the problem.
A week that lasted an eternity while I waited, hoping that I had resolved the issue.
The following week I began the slow restoration of systems back to the proper configuration now that the danger had passed.
And then I went to the user who had built the mini lab in their room.
I went to her and I asked her why she had built it and where she had received the equipment.
I also, kindly, let her know the specifics of how it could disrupt the entire network in the future and that I needed her to ensure, if she hooked it up again in the future, that she matched how it was plugged in now.
My first IT crisis was averted.
One of the most important lessons was learned and I learned it the hard way.
No comments:
Post a Comment