
We used Collectl version 4.0.0 and made the following configurations to avoid a couple of issues: We encapsulated Collectl in a Docker container in order to have a Docker image that basically covered all of our data collecting and shipping needs. # KBIn PktIn SizeIn MultI CmpI ErrIn KBOut PktOut SizeO CmpO ErrOutĠ8:46:35 3255 41000 81 0 0 0 112015 78837 1454 0 0 Using a Docker Container This second Collectl example displays nfs activity, along with memory usage and interrupts. This open-source, live project knows how to gather information but does not automatically ship it to the ELK stack. It can also be used to monitor other system resources such as inode use and open sockets.įinally, Collectl outputs metrics into a log file in plot format. We used it to generate, track, and save metrics such as network throughput, CPU Disk IO Wait %, free memory, and idle CPU (indicating overuse/underuse of computing resources). This cool open-source project comes with a great number of options that allow operations to measure various metrics from multiple different IT systems and save the data for later analysis. In the first stage of collecting and shipping data to Logstash, we used a tool called Collectl. In this article, I will share how we built our ELK stack to monitor our own service performance. Ultimately, software service operations teams use these graphs to present their results. Then, the data needs to be shipped to Logstash, stored and aggregated in Elasticsearch, and then turned into Kibana graphs.


Probes are required to run on each host to collect various system performance metrics. In order to use ELK to monitor your platform’s performance, a couple of tools and integrations are needed. So, I decided that I wanted to have one pane-of-glass to view performance metrics combined with all the events generated by the apps, operating systems, and network devices. To find that out, I had to jump between New Relic/Nagios and the ELK Stack. Very often, when I was troubleshooting performance issues, I saw a service that is or a couple of machines that are slowed down and reaching high-CPU utilization. This might mean that it lacks resources because of high load, but very often it means that there is a bug in the code, an exception, or an error flow that over-utilizes resources.
