Is it the Network or is it the Server?
Is it the network?
No matter how well we design networks, servers\applications can run slow or go offline completely. Some of this may be down to too many users accessing a service, hardware failures or security issues to name but a few. The problem is that every one will blame the network and it will be up to you to answer the question “is it the network or is it the server?”.
To be in the position to answer this question, you need data and this data can be acquired from network monitoring tools or log files. The important thing is to set these up now and not wait for problems to happen. There are hundreds of monitoring tools available and the trick is to get one to give you the right level of detail to get to the root cause of network and application issues.
For this example, I am going to focus on a web application which was reported to have been running slow. The story is based around a real world problem that I worked on recently; it is a straightforward client and single server configuration. However, I will look at tiered applications in a later post.
For most server troubleshooting scenarios, I start off by looking at what is happening on a network before moving onto look at what is happening locally on the server. My tool of choice is LANGuardian which is setup to monitor network traffic going to\from important servers.
The first data set that I look at is total traffic to the server broken down by protocol. Normally, you would see lots of traffic associated with open TCP ports on the server. This can vary if media streaming applications are in use, you may see more traffic associated with UDP protocols. As I am focusing on a web server, the ratios in the image below look correct, a lot more TCP traffic compared to UDP traffic. If the server was targeted as part of a DDoS attack you would also see a lot more UDP traffic.
The next step is to drill down on the traffic volumes and see what applications are in use. NetFlow based tools will try and label applications based on TCP\UDP port numbers. In my case, I am using network packets as a data source and so the application labels are based on the packet contents which is a lot more accurate. The top two applications are file sharing and web which looks normal as that is what the server is used for.
Moving on, I next take a look at the connection rates to the server. This report shows something interesting in that one client seems to be establishing a lot of connections to the server. The report is looking at a 20 minute time frame which suggests automation rather than a user connecting to the server. At this stage, it looks like the answer to the question “is it the network” is a no. Evidence so far suggests a user or application problem.
The next drill-down reveals the root cause for our server issue. A user called Laura.Ashton is accessing a resource called stress.htm on the server. Detail like this is called metadata, certain data fields which are captured from network traffic. A call to the user confirmed that they were running test scripts to check server performance under load. They stopped the scripts and server performance returned to normal.
Metadata is fast becoming a must have data source for troubleshooting security and operational issues. It is one of the main reasons why tools which monitor network traffic are seeing to be the next step up from flow based tools. Recently we asked a customer “What issue/requirement has the LANGuardian addressed for you?” Their response was “To get a deeper look into the traffic flow in and out of our network. It also allowed us to see what was hogging data.” For this customer, use cases like the one covered in this post are a regular thing and so tools like LANGuardian are a must have to answer that age old question of “Is it the network or is it the server“.