Replace Torrus with Prometheus snmp_exporter for PDUs monitoring
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Oct 18 2016, 4:07 PM

Description

ATM we're using Torrus (https://torrus.wikimedia.org) only for PDUs aggregates to report and track power usage. All of SNMP polling for network devices for example is handled inside librenms instead. I took a look at https://github.com/prometheus/snmp_exporter which recently got rewritten in Go and it might suit "PDU metrics" use case too.

Implementation would look like this:

snmp_exporter deployed on the host(s) that will do SNMP polling
Configure snmp_exporter with snmp community and a list of interesting OIDs to poll
- Including SNMP tables exported by the PDUs (i.e. described by this MIB http://www.circitor.fr/Mibs/Html/Sentry3-MIB.php)
The exporter above exposes a /snmp endpoint over HTTP that will poll a specified "target" when asked
Configure Prometheus to call the above endpoint for each PDU to monitor

TODO:

Integrate servertech4 MIB too, for newer PDUs (all of ulsfo, part of eqiad as of Jul 2019)
Namespace snmp_exporter metrics with e.g. snmp or pdu instead of the bare OID name (e.g. infeed)
(TBD how hard/complext it is to do) join infeed IDs with line IDs to have XYZ in metric labels instead of numeric IDs
Aggregate said metrics into the Prometheus global instance

Details

Subject	Repo	Branch	Lines +/-
prometheus: aggregate Sentry 4 metrics	operations/puppet	production	+6 -0
prometheus: fetch active netmon server from hiera	operations/puppet	production	+3 -2
prometheus: don't poll st4OutletCapabilities	operations/puppet	production	+1 -10
hieradata: let Prometheus on PoPs talk to snmp_exporter	operations/puppet	production	+5 -2
prometheus: generate targets for single phase PDUs	operations/puppet	production	+10 -1
facilities: introduce monitor_pdu_phase for ulsfo PDUs	operations/puppet	production	+32 -7
prometheus: add sentry4 outlet OIDs	operations/puppet	production	+274 -0
prometheus: bump timeout for pdu jobs	operations/puppet	production	+4 -0
prometheus: update snmp_exporter config	operations/puppet	production	+1 K -35
prometheus: don't snmp-poll st4InputCordNotifications	operations/puppet	production	+0 -8
prometheus: skip duplicates when generating pdu configuration	operations/puppet	production	+3 -1
prometheus: generate targets for sentry4 PDUs too	operations/puppet	production	+7 -1
prometheus: query pdu resources based on model	operations/puppet	production	+2 -1
facilities: add model to pdu monitoring	operations/puppet	production	+23 -17
prometheus: fix pdu_ metrics prefixing	operations/puppet	production	+4 -4
prometheus: add sentry4 PDUs support	operations/puppet	production	+36 -5
prometheus: prefix pdu metrics	operations/puppet	production	+25 -12
prometheus: add sentry4 snmp_exporter config	operations/puppet	production	+375 -0
prometheus: add aggregated PDU stats	operations/puppet	production	+8 -0
hieradata: allow codfw prometheus to talk to netmon eqiad	operations/puppet	production	+7 -0
prometheus: fix PDU detection and snmp_exporter config	operations/puppet	production	+2 -20
add PDUs jobs to prometheus	operations/puppet	production	+87 -1
prometheus: fix file permissions and servertech template	operations/puppet	production	+2 -1 K
Add network::monitor role	operations/puppet	production	+7 -1
prometheus: add snmp_exporter module and profile	operations/puppet	production	+2 K -0
facilities: add codfw PDUs	operations/puppet	production	+182 -2
facilities: add row and site parameters for pdus	operations/puppet	production	+98 -33

Related Objects

Mentioned In: T226778: Install new PDUs in rows A/B (Top level tracking task)
T229101: Phase monitoring for new PDUs
T214183: Setup graphs for power usage readings in Grafana
T87840: Retire Torrus
Mentioned Here: T87840: Retire Torrus
T214183: Setup graphs for power usage readings in Grafana
T171167: Evaluate LibreNMS' Graphite backend
P4995 Masterwork From Distant Lands

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

gerritbot added a project: Patch-For-Review.Mar 3 2017, 4:19 PM

Change 341533 had a related patch set uploaded (by filippo):
[operations/puppet] facilities: add row and site parameters for pdus

https://gerrit.wikimedia.org/r/341533

Change 341534 had a related patch set uploaded (by filippo):
[operations/puppet] facilities: add codfw PDUs

https://gerrit.wikimedia.org/r/341534

Change 341535 had a related patch set uploaded (by filippo):
[operations/puppet] [WIP] add PDUs jobs to prometheus

https://gerrit.wikimedia.org/r/341535

Change 341533 merged by Filippo Giunchedi:
[operations/puppet] facilities: add row and site parameters for pdus

https://gerrit.wikimedia.org/r/341533

Change 341534 merged by Filippo Giunchedi:
[operations/puppet] facilities: add codfw PDUs

https://gerrit.wikimedia.org/r/341534

Change 342648 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet] Add network::monitor role

https://gerrit.wikimedia.org/r/342648

fgiunchedi added a project: User-fgiunchedi.Mar 15 2017, 8:26 AM

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.

Change 341005 merged by Filippo Giunchedi:
[operations/puppet] prometheus: add snmp_exporter module and profile

https://gerrit.wikimedia.org/r/341005

Change 342648 merged by Filippo Giunchedi:
[operations/puppet] Add network::monitor role

https://gerrit.wikimedia.org/r/342648

Change 342862 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet] prometheus: fix file permissions and servertech template

https://gerrit.wikimedia.org/r/342862

Change 342862 merged by Filippo Giunchedi:
[operations/puppet] prometheus: fix file permissions and servertech template

https://gerrit.wikimedia.org/r/342862

Change 341535 merged by Filippo Giunchedi:
[operations/puppet@production] add PDUs jobs to prometheus

https://gerrit.wikimedia.org/r/341535

Change 344953 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet@production] prometheus: fix PDU detection and snmp_exporter config

https://gerrit.wikimedia.org/r/344953

Change 344953 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: fix PDU detection and snmp_exporter config

https://gerrit.wikimedia.org/r/344953

Most pieces are in place now, left to do:

allow prometheus in codfw to talk to netmon1001
report non-accepted values for "capabilities" OIDs to snmp_exporter upstream
aggregate and collect metrics globally too
grafana dashboards

Change 347622 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet@production] hieradata: allow codfw prometheus to talk to netmon eqiad

https://gerrit.wikimedia.org/r/347622

Change 347622 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: allow codfw prometheus to talk to netmon eqiad

https://gerrit.wikimedia.org/r/347622

fgiunchedi renamed this task from Evaluate prometheus snmp_exporter for Torrus PDUs metrics use case to Replace Torrus with Prometheus snmp_exporter for PDUs monitoring.May 2 2017, 8:30 AM

fgiunchedi claimed this task.

Change 352800 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add aggregated PDU stats

https://gerrit.wikimedia.org/r/352800

fgiunchedi mentioned this in T87840: Retire Torrus.May 9 2017, 11:34 AM

Change 352800 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add aggregated PDU stats

https://gerrit.wikimedia.org/r/352800

faidon moved this task from Inbox to In progress on the observability board.Jul 10 2017, 12:38 PM

faidon moved this task from In progress to Up next on the observability board.Jul 24 2017, 3:10 PM

We've ultimately gone with pushing librenms data into graphite in T171167: Evaluate LibreNMS' Graphite backend

@fgiunchedi: Could you elaborate why the SNMP exporter to prometheus didn't work for this in the end?

Right now I can only find a single graph with eqiad/codfw total (aggregated) power usage, but proper per-rack power usage data is still entirely missing. This makes it currently very hard to determine the total amount of power used per rack (across all phases) and to monitor things like phase imbalance.

I see LibreNMS seems to be exporting its PDU data (from SNMP) into Graphite, so probably these graphs can be created in Graphite as well. Please setup graphs in Graphite that allow power usage readings in a way where they are useful and meaningful for managing data center operations.

mark reopened this task as Open.Jan 11 2019, 3:13 PM

@fgiunchedi so could you describe in a bit more detail what is needed here and what were the challenges you faced with prometheus-snmp-exporter last time you attempted this?

fgiunchedi mentioned this in T214183: Setup graphs for power usage readings in Grafana.Jan 18 2019, 5:13 PM

In T148541#4892455, @faidon wrote:

@fgiunchedi so could you describe in a bit more detail what is needed here and what were the challenges you faced with prometheus-snmp-exporter last time you attempted this?

For sure: IIRC (and by re-reading T87840 for context too) the main challenge was around retention, hence the choice to go with librenms -> graphite instead. However snmp_exporter at the moment is working as expected in the sense that we do have metrics from it in Prometheus. I'll update the task description with more details on next steps to complete snmp_exporter deployment.

In T148541#4873008, @mark wrote:

Right now I can only find a single graph with eqiad/codfw total (aggregated) power usage, but proper per-rack power usage data is still entirely missing. This makes it currently very hard to determine the total amount of power used per rack (across all phases) and to monitor things like phase imbalance.

I see LibreNMS seems to be exporting its PDU data (from SNMP) into Graphite, so probably these graphs can be created in Graphite as well. Please setup graphs in Graphite that allow power usage readings in a way where they are useful and meaningful for managing data center operations.

Agreed, given that this task is about snmp_exporter and we'll be using librenms data in graphite now I've opened a new task specifically for this: T214183: Setup graphs for power usage readings in Grafana and likely we'll be needing some more visualizations than what I've put in the task description.

fgiunchedi updated the task description. (Show Details)Jan 18 2019, 5:19 PM

fgiunchedi updated the task description. (Show Details)Jan 21 2019, 4:46 PM

fgiunchedi moved this task from Doing to Up next on the User-fgiunchedi board.Jan 24 2019, 10:06 AM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:52 PM

fgiunchedi updated the task description. (Show Details)Jul 25 2019, 3:21 PM

fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.Jul 26 2019, 8:44 AM

fgiunchedi mentioned this in T229101: Phase monitoring for new PDUs.Jul 26 2019, 11:22 AM

Change 526615 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add sentry4 snmp_exporter config

https://gerrit.wikimedia.org/r/526615

Change 526616 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: prefix pdu metrics

https://gerrit.wikimedia.org/r/526616

Change 526615 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add sentry4 snmp_exporter config

https://gerrit.wikimedia.org/r/526615

Change 526616 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: prefix pdu metrics

https://gerrit.wikimedia.org/r/526616

Change 526619 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add sentry4 PDUs support

https://gerrit.wikimedia.org/r/526619

Change 526625 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: fix pdu_ metrics prefixing

https://gerrit.wikimedia.org/r/526625

Change 526619 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add sentry4 PDUs support

https://gerrit.wikimedia.org/r/526619

Change 526625 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: fix pdu_ metrics prefixing

https://gerrit.wikimedia.org/r/526625

Change 526633 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] facilities: add model to pdu monitoring

https://gerrit.wikimedia.org/r/526633

Change 526634 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: query pdu resources based on model

https://gerrit.wikimedia.org/r/526634

Change 526640 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: generate targets for sentry4 PDUs too

https://gerrit.wikimedia.org/r/526640

fgiunchedi updated the task description. (Show Details)Jul 31 2019, 1:24 PM

Change 526633 merged by Filippo Giunchedi:
[operations/puppet@production] facilities: add model to pdu monitoring

https://gerrit.wikimedia.org/r/526633

Change 526634 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: query pdu resources based on model

https://gerrit.wikimedia.org/r/526634

Change 526640 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: generate targets for sentry4 PDUs too

https://gerrit.wikimedia.org/r/526640

Change 527498 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: skip duplicates when generating pdu configuration

https://gerrit.wikimedia.org/r/527498

Change 527498 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: skip duplicates when generating pdu configuration

https://gerrit.wikimedia.org/r/527498

Change 527548 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: don't snmp-poll st4InputCordNotifications

https://gerrit.wikimedia.org/r/527548

Change 527548 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: don't snmp-poll st4InputCordNotifications

https://gerrit.wikimedia.org/r/527548

Change 528805 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: update snmp_exporter config

https://gerrit.wikimedia.org/r/528805

Change 528805 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: update snmp_exporter config

https://gerrit.wikimedia.org/r/528805

Change 528856 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: bump timeout for pdu jobs

https://gerrit.wikimedia.org/r/528856

Change 528857 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add sentry4 outlet OIDs

https://gerrit.wikimedia.org/r/528857

Change 528856 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: bump timeout for pdu jobs

https://gerrit.wikimedia.org/r/528856

Change 528857 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add sentry4 outlet OIDs

https://gerrit.wikimedia.org/r/528857

fgiunchedi updated the task description. (Show Details)Aug 12 2019, 9:12 AM

Change 529790 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] facilities: introduce monitor_pdu_phase for ulsfo PDUs

https://gerrit.wikimedia.org/r/529790

Change 529791 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: generate targets for single phase PDUs

https://gerrit.wikimedia.org/r/529791

Change 529790 merged by Filippo Giunchedi:
[operations/puppet@production] facilities: introduce monitor_pdu_phase for ulsfo PDUs

https://gerrit.wikimedia.org/r/529790

Change 529791 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: generate targets for single phase PDUs

https://gerrit.wikimedia.org/r/529791

Change 529797 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: let Prometheus on PoPs talk to snmp_exporter

https://gerrit.wikimedia.org/r/529797

Change 529797 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: let Prometheus on PoPs talk to snmp_exporter

https://gerrit.wikimedia.org/r/529797

Change 529800 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: don't poll st4OutletCapabilities

https://gerrit.wikimedia.org/r/529800

Change 529800 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: don't poll st4OutletCapabilities

https://gerrit.wikimedia.org/r/529800

Change 529914 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: fetch active netmon server from hiera

https://gerrit.wikimedia.org/r/529914

Change 529914 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: fetch active netmon server from hiera

https://gerrit.wikimedia.org/r/529914

We're now collecting metrics from all managed PDUs into prometheus, including environmental sensors. The names reflect what's in the snmp mib, modulo the pdu_ prefix we're using to namespace the metrics in Prometheus.

Sentry3

pdu_envMonContactClosureCount
pdu_envMonID
pdu_envMonName
pdu_envMonStatus
pdu_envMonTempHumidSensorCount
pdu_infeedApparentPower
pdu_infeedCapacityUsed
pdu_infeedCapacity
pdu_infeedCrestFactor
pdu_infeedEnergy
pdu_infeedID
pdu_infeedLineID
pdu_infeedLineToLineID
pdu_infeedLoadHighThresh
pdu_infeedLoadStatus
pdu_infeedLoadValue
pdu_infeedName
pdu_infeedOutletCount
pdu_infeedPhaseCurrent
pdu_infeedPhaseID
pdu_infeedPhaseVoltage
pdu_infeedPowerFactor
pdu_infeedPower
pdu_infeedReactance
pdu_infeedStatus
pdu_infeedVoltage
pdu_outletApparentPower
pdu_outletCapacity
pdu_outletControlAction
pdu_outletControlState
pdu_outletCrestFactor
pdu_outletEnergy
pdu_outletID
pdu_outletLoadHighThresh
pdu_outletLoadLowThresh
pdu_outletLoadStatus
pdu_outletLoadValue
pdu_outletName
pdu_outletPostOnDelay
pdu_outletPowerFactor
pdu_outletPower
pdu_outletStatus
pdu_outletVoltage
pdu_outletWakeupState
pdu_sysUpTime
pdu_tempHumidSensorHumidHighThresh
pdu_tempHumidSensorHumidLowThresh
pdu_tempHumidSensorHumidRecDelta
pdu_tempHumidSensorHumidStatus
pdu_tempHumidSensorHumidValue
pdu_tempHumidSensorID
pdu_tempHumidSensorName
pdu_tempHumidSensorStatus
pdu_tempHumidSensorTempHighThresh
pdu_tempHumidSensorTempLowThresh
pdu_tempHumidSensorTempRecDelta
pdu_tempHumidSensorTempScale
pdu_tempHumidSensorTempStatus
pdu_tempHumidSensorTempValue
pdu_towerActivePower
pdu_towerApparentPower
pdu_towerEnergy
pdu_towerID
pdu_towerInfeedCount
pdu_towerLineFrequency
pdu_towerModelNumber
pdu_towerName
pdu_towerPowerFactor
pdu_towerProductSN
pdu_towerStatus
pdu_towerVACapacityUsed
pdu_towerVACapacity

And Sentry4:

pdu_st4BranchCurrentCapacity
pdu_st4BranchCurrentStatus
pdu_st4BranchCurrentUtilized
pdu_st4BranchCurrent
pdu_st4BranchID
pdu_st4BranchLabel
pdu_st4BranchOcpID
pdu_st4BranchOutletCount
pdu_st4BranchPhaseID
pdu_st4BranchState
pdu_st4BranchStatus
pdu_st4HumidSensorID
pdu_st4HumidSensorName
pdu_st4HumidSensorStatus
pdu_st4HumidSensorValue
pdu_st4InputCordActivePowerStatus
pdu_st4InputCordActivePower
pdu_st4InputCordApparentPowerStatus
pdu_st4InputCordApparentPower
pdu_st4InputCordBranchCount
pdu_st4InputCordCurrentCapacityMax
pdu_st4InputCordCurrentCapacity
pdu_st4InputCordEnergy
pdu_st4InputCordFrequency
pdu_st4InputCordID
pdu_st4InputCordInletType
pdu_st4InputCordLineCount
pdu_st4InputCordName
pdu_st4InputCordNominalVoltageMax
pdu_st4InputCordNominalVoltageMin
pdu_st4InputCordNominalVoltage
pdu_st4InputCordOcpCount
pdu_st4InputCordOutOfBalanceStatus
pdu_st4InputCordOutOfBalance
pdu_st4InputCordOutletCount
pdu_st4InputCordPhaseCount
pdu_st4InputCordPowerCapacity
pdu_st4InputCordPowerFactorStatus
pdu_st4InputCordPowerFactor
pdu_st4InputCordPowerUtilized
pdu_st4InputCordState
pdu_st4InputCordStatus
pdu_st4LineCurrentCapacity
pdu_st4LineCurrentStatus
pdu_st4LineCurrentUtilized
pdu_st4LineCurrent
pdu_st4LineID
pdu_st4LineLabel
pdu_st4LineState
pdu_st4LineStatus
pdu_st4OcpBranchCount
pdu_st4OcpCurrentCapacityMax
pdu_st4OcpCurrentCapacity
pdu_st4OcpID
pdu_st4OcpLabel
pdu_st4OcpOutletCount
pdu_st4OcpStatus
pdu_st4OcpType
pdu_st4OutletBranchID
pdu_st4OutletCurrentCapacity
pdu_st4OutletID
pdu_st4OutletName
pdu_st4OutletOcpID
pdu_st4OutletPhaseID
pdu_st4OutletPowerCapacity
pdu_st4OutletSocketType
pdu_st4OutletState
pdu_st4OutletStatus
pdu_st4PhaseActivePower
pdu_st4PhaseApparentPower
pdu_st4PhaseBranchCount
pdu_st4PhaseCurrentCrestFactor
pdu_st4PhaseCurrent
pdu_st4PhaseEnergy
pdu_st4PhaseID
pdu_st4PhaseLabel
pdu_st4PhaseNominalVoltage
pdu_st4PhaseOutletCount
pdu_st4PhasePowerFactorStatus
pdu_st4PhasePowerFactor
pdu_st4PhaseReactance
pdu_st4PhaseState
pdu_st4PhaseStatus
pdu_st4PhaseVoltageDeviation
pdu_st4PhaseVoltageStatus
pdu_st4PhaseVoltage
pdu_st4TempSensorID
pdu_st4TempSensorName
pdu_st4TempSensorStatus
pdu_st4TempSensorValueMax
pdu_st4TempSensorValueMin
pdu_st4TempSensorValue
pdu_sysUpTime

From a chat with @faidon it emerged that we have at least three main use cases for PDU metrics:

Checking overload / availability of rack infeeds (e.g. for redundant power, if we're using over 50% of available power that means that going non-redundant will trip the breaker)
Power consumption for general site monitoring (per row/rack/site)
Capacity planning (e.g. for footprint expansion or shrinkage as needed) (per row/rack/site)

I'd like to get some input / review on which of the above infeed metrics we should be looking at to get the right numbers out, cc DC-Ops @wiki_willy

In T148541#5413482, @fgiunchedi wrote:

From a chat with @faidon it emerged that we have at least three main use cases for PDU metrics:

Checking overload / availability of rack infeeds (e.g. for redundant power, if we're using over 50% of available power that means that going non-redundant will trip the breaker)

Power consumption for general site monitoring (per row/rack/site)

Capacity planning (e.g. for footprint expansion or shrinkage as needed) (per row/rack/site)

I'd like to get some input / review on which of the above infeed metrics we should be looking at to get the right numbers out, cc DC-Ops @wiki_willy

So we also would love it if the metrics showed the phase loads on the XYZ phases for our 3 phase power. Those three phases need to stay closely balanced to prevent issues like loss of power efficiency and heat buildup, or the overload of one of the 3 phases before the others causing the PDU to improperly be at capacity. Seeing all of this in an easy metric review would be excellent.

So we likely need the following metrics for each PDU tower:

input voltage/amps for each tower (to show we're getting proper power delivery from the provider)
load/amps/voltage for the overall PDU utilization (to ensure no PDU is going over 50% capacity)
- load/amps/voltage of the overall Rack (combine ps1+ps2 totall power utilization)
- load/amps/voltage utilization for each phase in a 3 phase PDU (to ensure no phase is over 50% capacity & to keep them in sync)
  - Error reporting if these are ever more then X% out of sync. (We need to investigate what that % should be via best practices, right now we just try to get them as close as possible.)

This will allow us to do the things you outline, being:

checking overload/available power overhead in each rack.
overall power consumption on rack/site for reporting
capacity planning

I'll go over the above list in more detail and pick out the specific line items, but I wanted to output what my overall use of metrics is for PDUs right away.

Thanks a lot @RobH for the explanation! Please let me know if I can help with progressing this further

Note that the data is in LibreNMS as well, but with some limitations:

5min granularity
Not possible to stack or sum graphs (each power graph is independent)

On the plus side we do have threshold alerting.

Following up from irc with @RobH, what would be needed is the list of metrics from above to process and how (e.g. do they need to be combined, depending on the model?) to be able to do both alerting on e.g. phase imbalance (for three phase, for single phase we'll need to alert differently) capacity planning, and power usage reporting

fgiunchedi mentioned this in T226778: Install new PDUs in rows A/B (Top level tracking task).Sep 18 2019, 8:20 AM

In T148541#5432212, @RobH wrote:

So we likely need the following metrics for each PDU tower:

input voltage/amps for each tower (to show we're getting proper power delivery from the provider)

I took a stab at expressing these metrics as Prometheus queries below, at least for three phase PDUs, does that seem reasonable?

Input voltage per tower (average across phases, in volts). Voltage is reported in tenths of volts.
sentry4: avg by (instance, st4UnitIndex) (pdu_st4PhaseVoltage / 10)
sentry3: avg by (instance, towerIndex) (pdu_infeedVoltage / 10)

Input power per tower (average across phases, in amps). Apparent power / voltage (tenths of volts).
sentry4: avg by (instance, st4UnitIndex) (pdu_st4PhaseApparentPower / (pdu_st4PhaseVoltage/10))
sentry3: avg by (instance, towerIndex) (pdu_infeedApparentPower / (pdu_infeedVoltage/10))

load/amps/voltage for the overall PDU utilization (to ensure no PDU is going over 50% capacity)

load/amps/voltage of the overall Rack (combine ps1+ps2 totall power utilization)

load/amps/voltage utilization for each phase in a 3 phase PDU (to ensure no phase is over 50% capacity & to keep them in sync)

Error reporting if these are ever more then X% out of sync. (We need to investigate what that % should be via best practices, right now we just try to get them as close as possible.)

In this case what's considered the load? I'm guessing energy consumption in KWh ?

Change 552267 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: aggregate Sentry 4 metrics

https://gerrit.wikimedia.org/r/552267

Change 552267 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: aggregate Sentry 4 metrics

https://gerrit.wikimedia.org/r/552267

fgiunchedi moved this task from Up next to In progress on the observability board.Nov 25 2019, 1:46 PM

RobH unsubscribed.Mar 3 2020, 6:17 PM

All the things are in place now: namely we're collecting SNMP data from the PDUs via snmp_exporter, for both PDU models and offer aggregated plus drilldown views at https://grafana.wikimedia.org/d/f64mmDzMz/power-usage

Boldly resolving

Replace Torrus with Prometheus snmp_exporter for PDUs monitoringClosed, ResolvedPublicActions