无监控,不服务

What are considered best practices when monitoring a web server?

https://www.quora.com/What-are-considered-best-practices-when-monitoring-a-web-server

Stefan Greifeneder, Digital Marketing @Dynatrace, Growth Hacker, Wannabe Novelist

If you are thinking about web application monitoring, you will probably need to monitor

  • the website (availability, performance, troubleshooting)
  • the application (services) and the server itself (process: e.g. Tomcat, Nginx, …)
  • the infrastructure (physical hosts, virtual hosts, network, cloud services, maybe docker)

Dima Korolev, Principal Maintainer at C5T/Current (2014-present)

A short one: Monitor and have paging alerts on everything that ensures that the server is functioning and will function properly.

In the order of increasing importance:

  • The fact that the server is up.
  • The fact that the service is up (add an internal hook to confirm it).
  • The version of the service running (add an internal hook to get it). // 退化到监测各节点版本是否一致
  • Latency to this server, from this service and between this and other servers. Average and percentiles (or ratio of requests that fit into certain time bucket , this is even better).
  • The fact that the service can reach other services it depends on (add a hook to confirm those).
  • The fact that the service does its job (automatically run a real request or several real requests once in a while to confirm they return the expected result). // 关键业务监控
  • CPU, RAM, disk and network usage, now and historical data. Plotted as a graph. Alert if they get close to off the charts. // 如何判断是否 off the chart
  • QPS the service is serving. Monitor for spikes in either direction . // 如何判断是否有 spick
  • Second- and third-order issues, like load balancer sending too much / too little traffic to this server. // 因为 doodle 这个大杀器的存在,load balancer 的监控太不准了。

Peak detection

Peak detection of measured signal

There are lots and lots of classic peak detection methods, any of which might work. You’ll have to see what, in particular, bounds the quality of your data. Here are basic descriptions:

  1. Between any two points in your data, (x(0), y(0)) and (x(n), y(n)), add up y(i + 1) - y(i) for 0 <= i < n and call this T (“travel”) and set R (“rise”) to y(n) - y(0) + k for suitably small k. T/R > 1 indicates a peak. This works OK if large travel due to noise is unlikely or if noise distributes symmetrically around a base curve shape. For your application, accept the earliest peak with a score above a given threshold, or analyze the curve of travel per rise values for more interesting properties.
  2. Use matched filters to score similarity to a standard peak shape (essentially, use a normalized dot-product against some shape to get a cosine-metric of similarity)
  3. Deconvolve against a standard peak shape and check for high values (though I often find 2 to be less sensitive to noise for simple instrumentation output).
  4. Smooth the data and check for triplets of equally spaced points where, if x0 < x1 < x2, y1 > 0.5 * (y0 + y2), or check Euclidean distances like this: D((x0, y0), (x1, y1)) + D((x1, y1), (x2, y2)) > D((x0, y0),(x2, y2)), which relies on the triangle inequality. Using simple ratios will again provide you a scoring mechanism.
  5. Fit a very simple 2-gaussian mixture model to your data (for example, Numerical Recipes has a nice ready-made chunk of code). Take the earlier peak. This will deal correctly with overlapping peaks.
  6. Find the best match in the data to a simple Gaussian, Cauchy, Poisson, or what-have-you curve. Evaluate this curve over a broad range and subtract it from a copy of the data after noting it’s peak location. Repeat. Take the earliest peak whose model parameters (standard deviation probably, but some applications might care about kurtosis or other features) meet some criterion. Watch out for artifacts left behind when peaks are subtracted from the data.
Best match might be determined by the kind of match scoring suggested in #2 above.

I’ve done what you’re doing before: finding peaks in DNA sequence data, finding peaks in derivatives estimated from measured curves, and finding peaks in histograms.

I encourage you to attend carefully to proper baselining. Wiener filtering or other filtering or simple histogram analysis is often an easy way to baseline in the presence of noise.

Finally, if your data is typically noisy and you’re getting data off the card as unreferenced single-ended output (or even referenced, just not differential), and if you’re averaging lots of observations into each data point, try sorting those observations and throwing away the first and last quartile and averaging what remains. There are a host of such outlier elimination tactics that can be really useful.

需要实时告警的: 系统监控数据采集 - 服务存活监控 首页 http - worker冗余量 - 数据库连接数监控 需要知道 too many connections 的判定? - QPS. QPS 过低, - 单个 NURL 的 QPS 和响应时间

定期出报告的: - 内存监控 - 慢请求 慢请求个数 最长耗时值 排查性能问题的关键指标,会阻塞其他任务