Page MenuHomePhabricator

mwgate-node12-docker gate-and-submit builds failing (ENOENT _cacache errors resulting in corrupted tarballs)
Closed, ResolvedPublic

Description

mwgate-node12-docker builds seem to be consistently failing during the npm ci step since build #51491, 9:37 UTC today. First there’s a ton of ENOENT errors:

npm WARN old lockfile FetchError: Invalid response body while trying to fetch https://registry.npmjs.org/@babel%2fhighlight: ENOENT: no such file or directory, lstat '/cache/_cacache/content-v2/sha512/06/71/cb2a2bdfbdbf1bd35a24420468477d7f941ace00f42faa2fc38a2739e705faaad4d1be129a4514dd06948bac3db7598939323e2ed30957d33cc6a63caa6e'
npm WARN old lockfile     at /srv/npm/node_modules/minipass-fetch/lib/body.js:162:15
npm WARN old lockfile     at async Array.<anonymous> (/srv/npm/node_modules/@npmcli/arborist/lib/arborist/build-ideal-tree.js:691:9)
npm WARN old lockfile  Could not fetch metadata for @babel/highlight@7.13.10 FetchError: Invalid response body while trying to fetch https://registry.npmjs.org/@babel%2fhighlight: ENOENT: no such file or directory, lstat '/cache/_cacache/content-v2/sha512/06/71/cb2a2bdfbdbf1bd35a24420468477d7f941ace00f42faa2fc38a2739e705faaad4d1be129a4514dd06948bac3db7598939323e2ed30957d33cc6a63caa6e'
npm WARN old lockfile     at /srv/npm/node_modules/minipass-fetch/lib/body.js:162:15
npm WARN old lockfile     at async Array.<anonymous> (/srv/npm/node_modules/@npmcli/arborist/lib/arborist/build-ideal-tree.js:691:9) {
npm WARN old lockfile   code: 'ENOENT',
npm WARN old lockfile   errno: 'ENOENT',
npm WARN old lockfile   syscall: 'lstat',
npm WARN old lockfile   path: '/cache/_cacache/content-v2/sha512/06/71/cb2a2bdfbdbf1bd35a24420468477d7f941ace00f42faa2fc38a2739e705faaad4d1be129a4514dd06948bac3db7598939323e2ed30957d33cc6a63caa6e',
npm WARN old lockfile   type: 'system'
npm WARN old lockfile }
npm WARN old lockfile FetchError: Invalid response body while trying to fetch https://registry.npmjs.org/acorn-jsx: ENOENT: no such file or directory, lstat '/cache/_cacache/content-v2/sha512/e1/81/dcf5868737b247acf75fb69818c6998bdf9dda09e826bb2cdd0458e4c15d7dc5ff5fb5ac401c8ed24c81f20df1763d684891cd435abb7ef3c117aa3b2459'
npm WARN old lockfile     at /srv/npm/node_modules/minipass-fetch/lib/body.js:162:15
npm WARN old lockfile     at async Array.<anonymous> (/srv/npm/node_modules/@npmcli/arborist/lib/arborist/build-ideal-tree.js:691:9)
npm WARN old lockfile  Could not fetch metadata for acorn-jsx@5.3.1 FetchError: Invalid response body while trying to fetch https://registry.npmjs.org/acorn-jsx: ENOENT: no such file or directory, lstat '/cache/_cacache/content-v2/sha512/e1/81/dcf5868737b247acf75fb69818c6998bdf9dda09e826bb2cdd0458e4c15d7dc5ff5fb5ac401c8ed24c81f20df1763d684891cd435abb7ef3c117aa3b2459'
npm WARN old lockfile     at /srv/npm/node_modules/minipass-fetch/lib/body.js:162:15
npm WARN old lockfile     at async Array.<anonymous> (/srv/npm/node_modules/@npmcli/arborist/lib/arborist/build-ideal-tree.js:691:9) {
npm WARN old lockfile   code: 'ENOENT',
npm WARN old lockfile   errno: 'ENOENT',
npm WARN old lockfile   syscall: 'lstat',
npm WARN old lockfile   path: '/cache/_cacache/content-v2/sha512/e1/81/dcf5868737b247acf75fb69818c6998bdf9dda09e826bb2cdd0458e4c15d7dc5ff5fb5ac401c8ed24c81f20df1763d684891cd435abb7ef3c117aa3b2459',
npm WARN old lockfile   type: 'system'
npm WARN old lockfile }
npm WARN old lockfile FetchError: Invalid response body while trying to fetch https://registry.npmjs.org/ansi-regex: ENOENT: no such file or directory, lstat '/cache/_cacache/content-v2/sha512/98/4e/b32b490be0a12d75c86e002a64b1caf5f5c3d785392d2d01e42adbdf17516bb3b7bbf2c23819bff2b29927338c1f0d1d2da3a9f2c445c9073380503cddd1'
npm WARN old lockfile     at /srv/npm/node_modules/minipass-fetch/lib/body.js:162:15
npm WARN old lockfile     at async Array.<anonymous> (/srv/npm/node_modules/@npmcli/arborist/lib/arborist/build-ideal-tree.js:691:9)
npm WARN old lockfile  Could not fetch metadata for ansi-regex@5.0.1 FetchError: Invalid response body while trying to fetch https://registry.npmjs.org/ansi-regex: ENOENT: no such file or directory, lstat '/cache/_cacache/content-v2/sha512/98/4e/b32b490be0a12d75c86e002a64b1caf5f5c3d785392d2d01e42adbdf17516bb3b7bbf2c23819bff2b29927338c1f0d1d2da3a9f2c445c9073380503cddd1'
npm WARN old lockfile     at /srv/npm/node_modules/minipass-fetch/lib/body.js:162:15
npm WARN old lockfile     at async Array.<anonymous> (/srv/npm/node_modules/@npmcli/arborist/lib/arborist/build-ideal-tree.js:691:9) {
npm WARN old lockfile   code: 'ENOENT',
npm WARN old lockfile   errno: 'ENOENT',
npm WARN old lockfile   syscall: 'lstat',
npm WARN old lockfile   path: '/cache/_cacache/content-v2/sha512/98/4e/b32b490be0a12d75c86e002a64b1caf5f5c3d785392d2d01e42adbdf17516bb3b7bbf2c23819bff2b29927338c1f0d1d2da3a9f2c445c9073380503cddd1',
npm WARN old lockfile   type: 'system'
npm WARN old lockfile }

And then at the end a bunch of corrupted tarballs:

npm WARN tarball tarball data for which@https://registry.npmjs.org/which/-/which-2.0.2.tgz (sha512-BLI3Tl1TW3Pvl70l3yq3Y64i+awpwXqsGBYWkkqMtnbXgrMD+yj7rhW0kuEDxzJaYXGjEW5ogapKNMEKNMjibA==) seems to be corrupted. Trying again.
npm WARN tarball tarball data for unc-path-regex@https://registry.npmjs.org/unc-path-regex/-/unc-path-regex-0.1.2.tgz (sha1-5z3T17DXxe2G+6xrCufYxqadUPo=) seems to be corrupted. Trying again.
npm WARN tarball tarball data for spdx-exceptions@https://registry.npmjs.org/spdx-exceptions/-/spdx-exceptions-2.3.0.tgz (sha512-/tTrYOC7PPI1nUAgx34hUpqXuyJG+DTHJTnIULG4rDygi4xu/tfgmq1e1cIRwRzwZgo4NLySi+ricLkZkw4i5A==) seems to be corrupted. Trying again.
npm WARN tarball tarball data for spdx-license-ids@https://registry.npmjs.org/spdx-license-ids/-/spdx-license-ids-3.0.7.tgz (sha512-U+MTEOO0AiDzxwFvoa4JVnMV6mZlJKk2sBLt90s7G0Gd0Mlknc7kxEn3nuDPNZRta7O2uy8oLcZLVT+4sqNZHQ==) seems to be corrupted. Trying again.
npm WARN tarball tarball data for type-check@https://registry.npmjs.org/type-check/-/type-check-0.4.0.tgz (sha512-XleUoc9uwGXqjWwXaUTZAmzMcFZ5858QA2vvx1Ur5xIcixXIP+8LnFDgRplU30us6teqdlskFfu+ae4K79Ooew==) seems to be corrupted. Trying again.
npm WARN tarball tarball data for require-from-string@https://registry.npmjs.org/require-from-string/-/require-from-string-2.0.2.tgz (sha512-Xf0nWe6RseziFMu+Ap9biiUbmplq6S9/p+7w7YXP/JBHhrUDDUhwa+vANyubuqfZWTveU//DYVGsDG7RKL/vEw==) seems to be corrupted. Trying again.
npm WARN tarball tarball data for rimraf@https://registry.npmjs.org/rimraf/-/rimraf-3.0.2.tgz (sha512-JZkJMZkAGFFPP2YqXZXPbMlMBgsxzE8ILs4lMIX/2o0L9UBw9O/Y3o6wFw/i9YLapcUJWwqbi3kdxIPdC62TIA==) seems to be corrupted. Trying again.

Event Timeline

Shared build failure (blocking merges in several extensions) ⇒ UBN!

Yes, this is blocking CI from even verifying patches, not just merges. Thanks for filling Lucas!

Currently affected 2 patches I'm working on.

Seems to happen on several integration-agent-docker-* hosts (e.g. 1003, 1010, 1016), so it doesn’t look like it’s an issue with the existing cache on one of those hosts (which gets bind-mounted into the Docker container).

NPM isn’t reporting any issues on their status page yet.

I can’t reproduce the error locally with fresh-node, but that also has different Node/npm versions than CI (Fresh: Node.js v12.21.0 (npm 7.5.2); Node v12.22.5, npm 7.21.0), so that might not mean much.

Change 734946 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/core@master] DNM: empty change to test CI

https://gerrit.wikimedia.org/r/734946

Change 734946 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/core@master] DNM: empty change to test CI

https://gerrit.wikimedia.org/r/734946

Hm, the mwgate-node12-docker build succeeded here (#51522). Let me recheck another change to see if the issue doesn’t affect MediaWiki core or if it randomly fixed itself.

Change 734948 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] WIP: Update package-lock.json to lockfileVersion 2

https://gerrit.wikimedia.org/r/734948

Change 734948 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] WIP: Update package-lock.json to lockfileVersion 2

https://gerrit.wikimedia.org/r/734948

Updating all three of Wikibase’s package-lock.json files (#51528) seems to get rid of the ENOENT errors; the corrupted tarball messages are still there, but apparently they don’t block the npm ci from succeeding. The overall build still fails and I don’t yet see why (lints and tests seem to run more or less successfully).

Ah, found the error in that build.

Files dist//tainted-ref.common.js and /tmp/tainted-refs-build/tainted-ref.common.js differ
ERROR: "test:distnodiff" exited with 123.

That could conceivably be due to the package-lock.json changes, I’ll try rebuilding tainted-refs.

The same got reported a few days ago T293937#7450744 which was on mediawiki/skins/Vector and had an issue with:

ENOENT: no such file or directory, lstat '/cache/_cacache/content-v2/sha512/a0/b6/650d44d3252a72eb5e0c00a8d83f11789b2bded5c9debcbcbcfbc805ff3b82baaebd2088308dc7ee724bdf29cc26112d30e5b52f5161e4d5c11a8309b022'

What I suspect is that there is a race condition. When a build start, it rsync the npm cache from integration-castor instance then runs npm which can benefit some speedup by hitting the local cache. If during retrieval of the cache, another job that has completed in postmerge/gate-and-submit is in the process of saving the cache, one of the file might end up being deleted which would corrupt the cache for builds that retrieves it.

We save with rsync --delete-delay --delay-updates and load the cache with rsync --delay-updates

The cache is shared across MediaWiki repositories and namespaced by branch and job name. I thought some cache entries might be deleted since different repositories have different dependencies, then a dependency that is not been used by repository still has been populated when loading the cache and should thus be saved back.

I guess I can nuke all caches and start fresh :(

Mentioned in SAL (#wikimedia-releng) [2021-10-27T12:22:39Z] <hashar> integration-castor03: sudo rm -fR /srv/jenkins-workspace/caches/castor-mw-ext-and-skins/master/mwgate-node12-docker # T294426 T293937

Change 734948 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] WIP: Update package-lock.json to lockfileVersion 2

https://gerrit.wikimedia.org/r/734948

Main test build succeeded now, so it looks like upgrading the lock file works, though it might change the installed package versions slightly (otherwise the Bridge and Tainted Refs builds wouldn’t have changed).

Change 734948 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Update package-lock.json to lockfileVersion 2

https://gerrit.wikimedia.org/r/734948

Lucas_Werkmeister_WMDE claimed this task.

I think we can close this as resolved. @hashar fixed it for all repos with the cache wipe, but upgrading the lockfile also worked even before the cache was wiped – so it might be a good idea for other extensions to also upgrade to lockfileVersion 2 (might eventually be automated, see T273785 and subtasks).

Change 734946 abandoned by Lucas Werkmeister (WMDE):

[mediawiki/core@master] DNM: empty change to test CI

Reason:

https://gerrit.wikimedia.org/r/734946