Page MenuHomePhabricator

Various netbox alerts running for days
Open, MediumPublic

Description

The following netbox alerts have been fire for quite a few days on netbox1001. I have ack'ed them on Icinga in order to clean up things, and creating this task for follow up.

Captura de pantalla 2021-05-24 a las 9.56.52.png (216×1 px, 91 KB)

Event Timeline

This was the error shown in a few of the alerts:

An exception occurred: SSLError: HTTPSConnectionPool(host='puppetdb1002.eqiad.wmnet', port=8090): Max retries exceeded with url: ///v1/facts/is_virtual (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)'))) <pre>Traceback (most recent call last): File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen chunked=chunked, File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 382, in _make_request self._validate_conn(conn) File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn conn.connect() File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/urllib3/connection.py", line 421, in connect tls_in_tls=tls_in_tls, File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/urllib3/util/ssl_.py", line 429, in ssl_wrap_socket sock, context, tls_in_tls, server_hostname=server_hostname File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/urllib3/util/ssl_.py", line 472, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/usr/lib/python3.7/ssl.py", line 412, in wrap_socket session=session File "/usr/lib/python3.7/ssl.py", line 853, in _create self.do_handshake() File "/usr/lib/python3.7/ssl.py", line 1117, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/requests/adapters.py", line 449, in send timeout=timeout File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/urllib3/util/retry.py", line 573, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='puppetdb1002.eqiad.wmnet', port=8090): Max retries exceeded with url: ///v1/facts/is_virtual (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)'))) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/srv/deployment/netbox/deploy-cache/revs/dabbf5ed421a1f0d1a0c086726f646e2d29becd2/src/netbox/extras/reports.py", line 232, in run test_method() File "/srv/deployment/netbox-extras//reports/puppetdb.py", line 125, in test_netbox_in_puppetdb puppetdb_devices = self._get_puppetdb_fact("is_virtual") File "/srv/deployment/netbox-extras//reports/puppetdb.py", line 48, in _get_puppetdb_fact response = requests.get(url, verify=config["puppetdb"]["ca_cert"]) File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/requests/api.py", line 76, in get return request('get', url, params=params, **kwargs) File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/requests/api.py", line 61, in request return session.request(method=method, url=url, **kwargs) File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/requests/sessions.py", line 542, in request resp = self.send(prep, **send_kwargs) File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/requests/sessions.py", line 655, in send r = adapter.send(request, **kwargs) File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/requests/adapters.py", line 514, in send raise SSLError(e, request=request) requests.exceptions.SSLError: HTTPSConnectionPool(host='puppetdb1002.eqiad.wmnet', port=8090): Max retries exceeded with url: ///v1/facts/is_virtual (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)'))) </pre>

Change 693863 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] C:netbox: update SSL config to work with pki.discovery.wmnet

https://gerrit.wikimedia.org/r/693863

Change 693864 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/software/netbox-extras@master] reports/puppetdb: use system certificate store for tls verification

https://gerrit.wikimedia.org/r/693864

Change 693863 merged by Jbond:

[operations/puppet@production] C:netbox: update SSL config to work with pki.discovery.wmnet

https://gerrit.wikimedia.org/r/693863

Change 693864 merged by Jbond:

[operations/software/netbox-extras@master] reports/puppetdb: use system certificate store for tls verification

https://gerrit.wikimedia.org/r/693864

@Volans just looking throught your reverts/patches and checked on netbox-next and i think this is fixed now

Yes indeed, the specific TLS alert is resolved. As for the other failing reports AFAICT they require DC-Ops to chime in and have a look.

Yup, we have T283483 opened to address some of the alerts. There are also some LibreNMS alerts related to some sample PDUs we're testing out at codfw, so those will probably continue alerting over the next month or so. Thanks, Willy

Yes indeed, the specific TLS alert is resolved. As for the other failing reports AFAICT they require DC-Ops to chime in and have a look.

FYI, I cleaned up the LibreNMS alert, so only the accounting one is still alerting.

Current status is that 5 out of 9 Netbox reports are failing