Page MenuHomePhabricator

Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes
Closed, ResolvedPublic

Description

Description: T176312: Don’t check format constraint via SPARQL (safely evaluating user-provided regular expressions)
Timeline: <A desired timeline>
Diagram: <Link to an architectural diagram>
Technologies: PHP, bash
Point persons: @Legoktm, Wikidata team (Initial deployment: @Ladsgroup @Addshore @Lucas_Werkmeister_WMDE)

This is being run in a separate instance from Score Shellbox for better isolation and the performance characteristics will be rather different.


  • helm chart: already done
  • shellbox-constraints namespaces in k8s
  • shellbox-constraints accounts in k8s. ??
  • shellbox-constraints puppet private tokens.
  • Generate TLS certificates
  • Review helmfile.d files:
  • LVS setup
  • DNS for LVS records
  • Discovery DNS
  • envoy proxy
  • Monitoring dashboard
  • Integration and Acceptance tests

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
jbond triaged this task as Medium priority.Jun 21 2021, 2:46 PM

Any idea on a timeline for being able to get this ticket moving?
It's blocking T176312 which is in turn blocking T204031!

The main person working on this is Kunal and he was busy with deploying shellbox for Score last week and dc switchover, next week is WMF holiday so I guess the week after that we can start looking at it (/me secretly wishes we had several clones of Kunal)

The main person working on this is Kunal and he was busy with deploying shellbox for Score last week and dc switchover, next week is WMF holiday so I guess the week after that we can start looking at it (/me secretly wishes we had several clones of Kunal)

If I'm correct that's this week? :D

Yeah, we're looking at it :D It will take a bit of time, there some open questions like LVS or ingress we need to adrress

Is there an update on this? Anything we (WMDE) can do to help this move forward?

I think there are two options, depending on the level of security we want to achieve and the urgency of bringing this to production:

  1. We just point to the current installation and it should just work(TM). But we'd need to perform a small migration afterwards.
  2. We wait for Serviceops to set up an ingress properly, which is next in line for us, and we do a proper separate shellbox deployment for this. It will take more time but will in the end cost less.

Depending on:

  • When will this be needed?
  • How stringent are the isolation needs?

we can pick the best option.

I think there are two options, depending on the level of security we want to achieve and the urgency of bringing this to production:

  1. We just point to the current installation and it should just work(TM). But we'd need to perform a small migration afterwards.
  2. We wait for Serviceops to set up an ingress properly, which is next in line for us, and we do a proper separate shellbox deployment for this. It will take more time but will in the end cost less.

Can you give us a very rough idea what "more time" here means? Are talking about a small number of weeks, or are we talking about 6 months?

Depending on:

  • When will this be needed?

It would be nice to unblock those tasks, but we lived with the current situation for so long, a bit more time won't make a huge difference, I think.

  • How stringent are the isolation needs?

I'm not entirely sure, tbh. My limited understanding was that all this effort is made in order to guard against malicious regex attacks in a more efficient way. So in that sense, more isolation might be preferable? But I'm really not the expert here, @Ladsgroup or @Lucas_Werkmeister_WMDE can probably give a more authoritative answer and correct the rubbish that I'm talking.

I think there are two options, depending on the level of security we want to achieve and the urgency of bringing this to production:

  1. We just point to the current installation and it should just work(TM). But we'd need to perform a small migration afterwards.

To make things even more complicated, I want to say that the config is a ratio so we can start by sending only 1% of these regexes to shellbox to make sure everything works fine and nothing explodes majestically and slowly increase it. If we hit shellbox limits in matter of resources, we can tune the ratio down.

I like to do that to at least make sure the setup works, we can for example send 10% to the main shellbox (and 90% to wdqs) and wait until ingress is setup and then simply move to that path and increase it to 100%. Would that work?

  • How stringent are the isolation needs?

I'm not entirely sure, tbh. My limited understanding was that all this effort is made in order to guard against malicious regex attacks in a more efficient way. So in that sense, more isolation might be preferable? But I'm really not the expert here, @Ladsgroup or @Lucas_Werkmeister_WMDE can probably give a more authoritative answer and correct the rubbish that I'm talking.

I don't think this service has strong isolation requirements, but Shellbox for Score/lilypond does. I would much rather have this be a separate shellbox install mostly so lilypond/ghostscript/lame are isolated.

I think there are two options, depending on the level of security we want to achieve and the urgency of bringing this to production:

  1. We just point to the current installation and it should just work(TM). But we'd need to perform a small migration afterwards.
  2. We wait for Serviceops to set up an ingress properly, which is next in line for us, and we do a proper separate shellbox deployment for this. It will take more time but will in the end cost less.

Also #3, which is to set up another LVS endpoint for this, separate, Shellbox instance. Which is not technically nice, but nothing blocking us from doing it right now.

Change 709093 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/libs/Shellbox@master] pipeline: Publish plain Shellbox image for RPC usage

https://gerrit.wikimedia.org/r/709093

Change 709097 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] Add tokens and users for shellbox-constraints service

https://gerrit.wikimedia.org/r/709097

Change 709093 merged by jenkins-bot:

[mediawiki/libs/Shellbox@master] pipeline: Publish plain Shellbox image for RPC usage

https://gerrit.wikimedia.org/r/709093

Change 709104 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/deployment-charts@master] Add shellbox-constraints namespace

https://gerrit.wikimedia.org/r/709104

Change 709106 had a related patch set uploaded (by Legoktm; author: Legoktm):

[labs/private@master] Add tokens for shellbox-constraints service

https://gerrit.wikimedia.org/r/709106

Change 709106 merged by Legoktm:

[labs/private@master] Add tokens for shellbox-constraints service

https://gerrit.wikimedia.org/r/709106

Change 709108 had a related patch set uploaded (by Legoktm; author: Legoktm):

[labs/private@master] Add k8s shellbox and shellbox-constraints users

https://gerrit.wikimedia.org/r/709108

Change 709108 merged by Legoktm:

[labs/private@master] Add k8s shellbox and shellbox-constraints users

https://gerrit.wikimedia.org/r/709108

Change 709097 merged by Legoktm:

[operations/puppet@production] Add tokens and users for shellbox-constraints service

https://gerrit.wikimedia.org/r/709097

Change 709104 merged by jenkins-bot:

[operations/deployment-charts@master] Add shellbox-constraints namespace

https://gerrit.wikimedia.org/r/709104

Change 709114 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/deployment-charts@master] Add helmfile.d for shellbox-constraints

https://gerrit.wikimedia.org/r/709114

Change 709114 merged by jenkins-bot:

[operations/deployment-charts@master] Add helmfile.d for shellbox-constraints

https://gerrit.wikimedia.org/r/709114

$ curl https://staging.svc.eqiad.wmnet:4010/healthz
{
    "__": "Shellbox running",
    "pid": 9
}

Next steps are to generate TLS certs, deploy to eqiad/codfw clusters, and then set up LVS.

Change 709566 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] Add shellbox-constraints to LVS

https://gerrit.wikimedia.org/r/709566

Change 709567 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] service: Switch shellbox-constraints to lvs_setup

https://gerrit.wikimedia.org/r/709567

Change 709568 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] service: Switch shellbox-constraints to monitoring_setup

https://gerrit.wikimedia.org/r/709568

Change 709569 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] service: Switch shellbox-constraints to production

https://gerrit.wikimedia.org/r/709569

Change 709571 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/dns@master] Add shellbox-constraints.svc.{codfw,eqiad}.wmnet

https://gerrit.wikimedia.org/r/709571

Change 709572 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/dns@master] Add shellbox-constraints to discovery

https://gerrit.wikimedia.org/r/709572

Change 709566 merged by Legoktm:

[operations/puppet@production] Add shellbox-constraints to LVS

https://gerrit.wikimedia.org/r/709566

Change 709567 merged by Legoktm:

[operations/puppet@production] service: Switch shellbox-constraints to lvs_setup

https://gerrit.wikimedia.org/r/709567

Change 709571 merged by Legoktm:

[operations/dns@master] Add shellbox-constraints.svc.{codfw,eqiad}.wmnet

https://gerrit.wikimedia.org/r/709571

Change 709568 merged by Legoktm:

[operations/puppet@production] service: Switch shellbox-constraints to monitoring_setup

https://gerrit.wikimedia.org/r/709568

Change 709569 merged by Legoktm:

[operations/puppet@production] service: Switch shellbox-constraints to production

https://gerrit.wikimedia.org/r/709569

Change 709572 merged by Legoktm:

[operations/dns@master] Add shellbox-constraints to discovery

https://gerrit.wikimedia.org/r/709572

Change 709960 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] services_proxy: Add envoyproxy for shellbox-constraints

https://gerrit.wikimedia.org/r/709960

Change 710109 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/deployment-charts@master] mwdebug: Add shellbox-constraints envoyproxy

https://gerrit.wikimedia.org/r/710109

Change 709960 merged by Legoktm:

[operations/puppet@production] services_proxy: Add envoyproxy for shellbox-constraints

https://gerrit.wikimedia.org/r/709960

Change 710109 merged by jenkins-bot:

[operations/deployment-charts@master] mwdebug: Add shellbox-constraints envoyproxy

https://gerrit.wikimedia.org/r/710109

Legoktm claimed this task.
Legoktm updated the task description. (Show Details)

I'm going to close this as resolved as I believe everything is now set up on the k8s/SRE side of things, though there will be additional tuning needed as the rollout progresses, which is happening on T176312: Don’t check format constraint via SPARQL (safely evaluating user-provided regular expressions).

Change 711245 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] configmaster: Add shellbox-constraints to disc_desired_state

https://gerrit.wikimedia.org/r/711245

Change 711245 merged by Legoktm:

[operations/puppet@production] configmaster: Add shellbox-constraints to disc_desired_state

https://gerrit.wikimedia.org/r/711245