Nagios Core Optimization By Utilizing Telegram as Notification of Disturbance

Network Monitoring System (NMS) is a system that is highly demanded internet service provider industry in this fast-developing information technology era. The availability of NMS is the best option to restore the service level agreement as a means to compete with other internet service providers’ competitors. The occurrence of disturbance in the network is often unnoticed by the network administrator. This may lead to a crucial problem in decreasing network quality as the impact of time-consuming in solving the problem. Through the explanation, the writer tried to anticipate by classifying problems using Pareto, and integrated Nagios with Telegram Messenger as a notification of disturbance. Nagios has many features such as reports, event handler, monitoring resource (CPU load, memory usage, status up / down, up time, data traffic, bandwidth), etc. One of notable feature owned by Nagios is blast notification of disturbance. It is a feature that will function when one of the devices is in trouble. This feature will inform the network administrator or authorized person in a certain divisions as regards the error network. In this case, the problematic device can be categorized according to the parameters made by the network administrator.


Introduction *
Network Monitoring System (NMS) is a system that is highly demanded internet service provider industry in this fastdeveloping information technology era. There are three things to consider in managing complex networks, the structure, management, and effectiveness of the network [1].
The availability of NMS is the best option to restore the service level agreement as a means to compete with other internet service providers' competitors. In order to keep the track of information about the network device such as a server, router, switch, and endpoint devices run as it is, NMS is inevitably right [2].
The occurrence of disturbance in the network is often unnoticed by the network administrator. This may lead to a crucial problem in decreasing network quality as the impact of time-consuming in solving the problem.
The monitoring absence of system operators toward the running application in NMS in the network operation center leads to a problem. This presumably occurs due to personal or supplementary activity, which urges the system operator to check out the NOC room.
A multinational level company, which works in the internet service provider sector, has been in the same state of such problem. Due to the overdue initial problem-solving in analyzing the cause of certain troubled device gives impact to a reduction on SLA approval according to the agreement.
A relevant application which may help to minimize the effect of the problem is Nagios. Nagios has many features such as reports, event handler, monitoring resource (CPU load, memory, status up / down, uptime, data traffic, bandwidth) and etc. One of notable feature owned by Nagios is blast notification alert. It is a feature that will function when one of the devices is in trouble. This feature will inform the network administrator or authorized person in a certain division as regards the error network.
The error device can be categorized by predetermined parameters that have been set or made by the network operators. The error notification can be transmitted to various ways of communication through e-mail, short message service (SMS), even WhatsApp messenger or Telegram.

Methodology
The methods in conducting the research are:

Analysis of Interruption Reports and Determining Problems
This research uses the Pareto method to classify the problems in order to maintain SLA approval according to the agreement. By using SLA calculation illustration in a month, if the internet service provider offers 99% SLA and 1% of the SLA is still within the tolerance limit. From the calculation above, it can be concluded that 713 hours is a guarantee given by the internet service provider. If there is a problem occur within 7 hours, this is still tolerable. On the other hand, if there is an internet connection interruption over 7 hours, customers can ask for restitution to the internet service provider.
Pareto is chart or image that sort the data classification from left to right following the highest-ranking order to the lowest, hence finding the problems that often occur is able to be resolved immediately (based on the highest to the lowest ranking [3]. The steps of Pareto: 1. Deciding the method or the meaning of data classification, which based on the problem, the cause, deviation types, etc.; 2. Determine the unit for scaling the characteristic order in example currency (rupiahs), frequency, units, and etc; 3. Collect the data according to the specified time interval; 4. Summarize the data and create data category level from the biggest to the smallest; 5. Calculate the used frequency or percentage cumulative; and 6. Drawbar chart shows the relative importance of the disturbance and identify some essential things to observe later.
According to the data that has been acquired, the writer categorizes some important highlight: As is seen in Figure 2, the most frequent disturbance in August 2019 is the Duration of the Received Information to the operator or the network administer. This problem shows that slow information exchange between the departments affects the decreased SLA [4].
Figure 2 on C category, (Duration of the Received Information) has 10 times disturbance with downtime for 675 minutes or 11 hours 25 minutes. In other words, this table shows that the C category exceeds SLA (675 minutes) that guaranteed by the internet service provider. Therefore, the writer limits the discussion on how to overcome the problem in C, meanwhile, the other four categories A, B, D, and E do not exceed the guaranteed period SLA (Downtime within 420 Minutes). The downtime data for each category is presented in Table 1.

Data Collection
The required data for the writer is obtained from direct observation in the form of Escalation Process Flowchart, Existing Topology, and disturbance report from one of the company for a month, start from August 1, 2019, to August 31, 2019, as the data analysis the cause of decreasing SLA.  Figure 3 shows the description of the escalation process length significantly affect SLA guaranteed. Due to the ineffective process, only by noticing each process takes 2 to 30 minutes. This may cause wasting more time when the technical team organizes visit period for further checking, time consume needed is more likely to be tentative since it is possible to hold a visitation the next day under the customer agreement.
If the escalation time is calculated excluding tentative visitation time, it spends more than 100 minutes while the customer submits a complaint through customer service. On the contrary, if the customer makes a complaint to the salesperson, it unavoidably spends around 150 minutes only to confirm visitation in order to check the network device.
This condition is getting more deteriorate and ineffective when the administrators are not available or present in the place so that the process of disturbance causal analysis takes more time. Before conducting system development, it is an urgency to take previous existed network topology-related data. The following chart is the existing topology: From Figure 4 it can be inferred that only the NMS server existing in NOC room and required to watch over frequently.
In order to handle the problem, Nagios with blast alert by Telegram feature is desired to accelerate network administrator in receiving information faster if there is any troublesome network device (in accordance with predefined parameters by the network administrator) and faster in analyzing and handling to minimize downtime which may influence SLA.
Below is the suggested topology by implementing the data collection that has been gathered as a developing system: From figure 5 suggested by the writer, the network administrator is not necessarily to be in NOC room to keep track of the interfered network devices. By overviewing the suggested topology, it is possible to alter the escalation process.
Furthermore, the writer proposes a specification server needed to install Nagios Server after the writer suggests a new suggested network topology, as it follows table 2. On the assumption that the server is available according to minimum specification in the figure 2, there are some things need to be considered before making installation and configuration Nagios on the server provided. These requirements are: the server that has been provided by the server must have apache installed as webserver whose job is to display the web GUI (Graphical User Interface), PHP (Hypertext Pre Processor), NRPE (Nagios Remote Plugins Executor) already installed on the agent want to monitor and other services needed in the future. NRPE is a plugin which provides information NMS required such as Ping (Packet Internet Gohper), SSH (Secure Shell), CPU (Central Processing Unit) Load, and etc. The next step is installation Nagios along with Agent and SNMP (Simple Network Management Protocol).

Development and Configuration
In this development and configuration method section, the writer explains certain things to consider during Nagios Server configuration, NRPE as well as SNMP configuration.
The important highlight to avoid failure during Nagios installation is, the user is expected to turn off the firewall and Selinux so that the port Nagios will use for monitoring the system will not be restrained from the firewall and Selinux itself. Another following step to be noticed is to allow service Nagios runs after booting finish (startup process) so that it is not necessary to activate the command line manually. Fig. 6 Dashboard Nagios Core Figure 6 shows the display of installed GUI Nagios and rendered host, services and additional features in which Nagios Core will monitor. Iso Nagios Core or installer is available on Nagios official website (https://www.nagios.org/).
After making sure NRPE and SNMP configured prudently, the nest action to perform is Telegram Messenger. Nagios Server will be integrated with Telegram, as a notification of interruption with the workflows shown in figure 7. Referring to notification of interruption workflow, the initial action to overcome this condition is to make BOT (a computer program to run automatic command) to receive a token as a tool command BOT as a notification alert sender. BOT Telegram can be performed from BOT Father.

Fig. 8. BOT Token
After successfully creating BOT as shown in figure 8, the next following action is to create a group that has been installed BOT, consisting of networks administrators or people from Network division. This is aimed to obtain the ID group for BOT determinant the desired group chat. Eventually, the writer configured on Nagios Server side by turning on notify disturbance for hosts and services desired to be monitored, as shown in figure 11 and figure 12. From figure 11, the configuration performs order for Nagios to run blast alert if there is an inactive host. Meanwhile, figure 11 shows the configuration runs the command for Nagios performs blast alert if there is monitored service undergo any disturbance. In the command line, a link from API telegram is detected. To get the API, the user can access on api.telegram.org.

Result dan Discussion
In this section, the writer did a trial toward NRPE and SNMP to obtain the average time needed in the process of sending disturbance notification through telegram using a formula to find the Mean value in stress test data selection method and shut down the Nagios monitored device. The Mean value from data collection is gathered from adding the numbers in the data set and dividing the number of individual numbers [6]. In statistics, this term is called mean arithmetic and symbolized with M, the formula is shown as follows: with: M = Mean (Average Value) ∑X = Sum of Value N = Number of Individuals Before doing the trial toward NRPE and SNMP, the writer executed a test through a standard network protocol. The test is done using ICMP protocol or Ping in general term.  The result of ICMP protocol or Ping test is conducted by disbanding the device and it results in 85 seconds average number for the information reach network administration. The value is presented in Table 3.

NRPE Test
The table 4 is the result of functional NRPE trial toward device server monitored by Nagios. The result of functional CPU load via NRPE test using the stress test method obtains average value 127 seconds to reach the network administration. The value result is presented in Table 4.

Test Disbanded Device Period Telegram Disturbance Info Range
From all the six tests conducted by disbanding the device and stress test method as presented above, the result of average time all NRPE and SNMP delivers in Table 9. By optimizing Nagios Core using telegram as a disturbance notification, signify that the information interfered device will reach the network administrator takes time for ≥ 2 minutes. This condition replaces the escalation process, in which Figure 22 shows the flow before the procedure and the flow after the procedure is in Figure 13. In Figure 14, the escalation process flowchart cuts time up to 84 minutes for technical team contacts the customer to make an appointment for a physical visit. The data in Table 10 presents the time of the escalation process from Figures 13 and 14 to make an appointment for a physical visit. It spends the longest time for 150 minutes and 100 minutes the fastest. After implementing the system proposed by the writer, the longest time to make an appointment for a physical visit is 66 minutes and the fastest time is 29 minutes. As a result, the implemented system is capable to solve the problem in C category. This progress satisfies network administrators to enable controlling the network device without being present in NOC room. However, the implemented system is unable to solve all problems occur in one of the multinational company.
Escalation Process Flowchart Before Implementation After Implementation The slowest 150 minutes 66 minutes The fastest 100 minutes 29 minutes

Conclusion
Nagios has many features such as reports, event handler, monitoring resource (CPU load, memory usage, status up / down, up time, data traffic, bandwidth), etc. One of notable feature owned by Nagios is blast notification of disturbance. It is a feature that will function when one of the devices is in trouble. This feature will inform the network administrator or authorized person in a certain divisions as regards the error network. In this case, the problematic device can be categorized according to the parameters made by the network administrator.
Finally yet importantly, the writer expects the readers to proceed with this research as a next step in determining a solution that has not completed at the moment. Hopefully, there will be continuation regards with the development of this currently-used system or designing a new system to be integrated as a new sophisticated system that can be the solution to respond to system disturbances.