The (state of the) Art of Monitoring

The (state of the) Art of Monitoring

Hey, my name is Pedro and I’m a Monitoring Engineer for Lunik. My job is all about the art of monitoring our customers internet-based business. Working for LUNIK is a constant challenge, our ever-changing industry requirements have allowed me to learn, design and implement state-of-the art technologies, along with a continuous improvement cycle on everything I do.

I’m a proud member of monitoring team, a highly skilled, highly motivated one! A team of tech savvies mates helping each other accomplish nearly magical stuff every day, from early detection of issues and proper notification and escalation via centralized alert management to research-design-implementation-feedbacks on SEO/Remote User Monitoring, from coding our own tools to creating amazing dashboards on visualization technologies. Bored of day to day repetitive tasks? We have the power to automate them (for good!). Not satisfied with a “solution” to something? Time to think it over, get a better solution, stress it and push it to production.

What we do:

    • RUM monitoring: elastic-apm-rum agent is embedded on our customers web applications; we administer full elastic stack (elastic search – log stash – apm-server – Kibana) along open source alerting solutions for early detection of any application performance issues, like drop of activities, logon delays, or even pattern utilization variations on instrumented pages.
    • Application monitoring: backend web applications (services) are instrumented either with Prometheus exporters or with elastic-apm-agents to provide an insight of how each application is performing. The data is collected and analyzed by our monitoring infrastructure (Prometheus-alert manager stack or Elastic Stack) and alerts are generated based on performance thresholds.
    • SEO monitoring: we are concerned on providing SEO relevant information on our customers’ applications; we implement testing frameworks like selenium, lighthouse, Sitespeed and API consumers so we can collect and report on failing unexpected performance issues that negatively affects SEO positioning.
    • Synthetic monitoring: we implement geo distributed testing of web applications to ensure customer sites are up and base services are responding to predefined tests from the regions our customers offer their services!.
    • Infrastructure Monitoring: we have more than 5000 servers monitored under an Icinga-based infrastructure so we keep them all up and running 24x7x365 on the always-needed CPU, Memory, Load, disk and traffic state plus many other box-related services we keep developing to improve customer availability. Needless to say, we do monitor network links on all core points of the infrastructure: Wan links, public-internet links, firewalls, routers, core-switches.
    • Prometheus monitoring: Not only in-house made software is instrumented with Prometheus, but node exporters, application exporters and custom exporters are an integral part of the servers, so Prometheus can do it’s magic on heavy data processing and smart alerting on them!.
    • Centralized event management: Having such a mixture of monitoring technologies, (Icinga, Prometheus, ElasticAPM, RUM, Synthetic….) we have also implemented an “Alerta” based NOC, where alerts from all monitoring sources are centralized and presented on an unified interface for our support teams to react seamlessly, regardless of how the alert was detected. Automated messaging is delivered from this application to relevant chat channels when detection occurs and standard operational procedures are followed to resolve incidents detected.

If technologies and challenges are not enough to keep you motivated, we have best-in-the industry perches; Trainings, Certifications, Congresses, Mentoring, language skills… and much more… but I’ll let HR handle that. Bring up all you’ve got, and come and join a first class team! Your talent will be rewarded, and there is no limit to the desire to accomplish your dream career path.