Ticket #371 (closed requirement: fixed)

Opened 4 years ago

Last modified 3 years ago

The system shall monitor QoS against SLA

Reported by: johnbywater Owned by: nils.toedtmann
Priority: critical Milestone: ckan-v1.4
Component: ckan Keywords:
Cc: nils.toedtmann@… Repository:
Theme:

Description (last modified by johnbywater) (diff)

Requested by DGU.

Change History

comment:1 Changed 4 years ago by johnbywater

  • Owner set to johnbywater

comment:2 Changed 4 years ago by johnbywater

  • Description modified (diff)

comment:3 Changed 4 years ago by johnbywater

  • Type changed from enhancement to requirement
  • businessvalue set to 300
  • Description modified (diff)

comment:4 Changed 4 years ago by johnbywater

  • Status changed from new to closed

comment:5 Changed 4 years ago by johnbywater

  • Status changed from closed to reopened

comment:6 Changed 4 years ago by johnbywater

  • businessvalue changed from 300 to 1200

comment:7 Changed 4 years ago by johnbywater

  • Milestone set to v1.1

comment:8 Changed 4 years ago by johnbywater

  • Summary changed from Implement QoS monitoring to The system shall monitor QoS against SLA

comment:9 Changed 3 years ago by nils.toedtmann

  • Cc nils.toedtmann@… added

(I know the term "QoS" as a very specific networking term about classifying and prioritising network traffic. I assume here it means uptime, availability, performance monitoring?)

There seem to be are at least three monitors already in place:

  • http://munin.okfn.org/ on eu1 monitoring eu[0-7] and us1, gathering additional health information via locally installed daemons. Munin's notification subsystem is not configured.
  • http://nagios.hmg.ckan.net/ on hmg.ckan.net monitoring the CKAN-HMG service group (network monitoring only). Notfications are not configured (or?)
  • We have a http://wasitup.com/ account which is watching some OKFN services (e.g. {ca,de,www}.ckan.org, {blog,www}.okfn.org) and sending loads of alerts to sysadmin@…. Only checking for "HTTP 200 OK" and whether the response contains a configurable string.

My 2ct: We should consolidate. What do we want?

In the latter case we want to have a separate machine which is not in on EC2 (but e.g. ByteMark?), dedicated to monitoring only.

We should also include root mails into the alert/notification policies. Root mails should be trimmed down to important warnings and errors only.

comment:10 Changed 3 years ago by nils.toedtmann

The nagios fork OPSview might be worth a look.

comment:11 Changed 3 years ago by nils.toedtmann

Replying to nils.toedtmann:

There seem to be are at least three monitors already in place:


Correction: at least four, we seem to have a Montastic account, too:

On 18/12/10 15:03, noreply@… wrote:

Dear okfn,
 
This is a monthly reminder that you have an account on Montastic, the
website monitor service.
 
### ACCOUNT INFORMATION
Signup date: 2009-10-06
Email you signup with: [email protected]
 
### 20 WEBSITES MONITORED
[OK] - http://www.ckan.net/
[OK] - http://www.knowledgeforge.net/
[OK] - http://okfn.org/
[not monitored] - http://blog.okfn.org/
[...]
 
### EMAIL ALERT RECIPIENTS
- [email protected]
- [email protected]
- [email protected]
[...]
To make changes to your account or contact us, go to www.montastic.com.
[...]

comment:12 Changed 3 years ago by thejimmyg

  • Owner changed from johnbywater to nils.toedtmann
  • Status changed from reopened to new

comment:13 Changed 3 years ago by thejimmyg

It is implied in this that the performance of sites should beat the QoS criteria, therefore closing #485. Ensuring this happens is an ongoing process.

comment:14 Changed 3 years ago by thejimmyg

From #440 we'll also need to "Write and pass comprehensive performance tests"

comment:15 Changed 3 years ago by thejimmyg

From #395:

At the moment, some pages within CKAN tend to load slowly. We should create a profiling setup in which we can measure response times for complete requests and individual methods calls.

This could be used to identify bottlenecks and find an appropriate caching or tuning strategy to improve CKAN performance.

NB: We should also agree on a maximum request latency.

TODO: Read up on all those QoS tickets to avoid overlapping efforts.

comment:16 Changed 3 years ago by anonymous

Mainly handled in http://knowledgeforge.net/okfn/tasks/ticket/564 now. Close here?

comment:17 Changed 3 years ago by thejimmyg

  • Status changed from new to closed
  • Resolution set to fixed

Marking as closed since http://knowledgeforge.net/okfn/tasks/ticket/600 now takes on this ticket.

I will check nils has added the new DGU Bytemark servers are added to Nagios.

Note: See TracTickets for help on using tickets.