Toolforge ingress: decide on final layout of north-south proxy setup
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Sep 27 2019, 12:43 PM

Description

In T228500: Toolforge: evaluate ingress mechanism we discussed several setups for the north-south traffic and proxy setup. With north-south I mean traffic between end users (internet) and the pod containing the tool webservice.

Related things to decide:

Do we want to introduce $tool.$domain.org yes or not. My feeling is yes. Also, if we introduce this pattern, do it only for toolforge.org
Do we want to introduce toolforge.org yes or not. My feeling is that yes.
Will the legacy k8s be aware of the 2 things above? i.e, would we introduce either $tool.$domain.org or toolforge.org/$tool in the old k8s deployment. My feeling is that we don't want this, as will be a lot of work that will only be valid for the compat/migration period between k8s deployments.
Will the web grid be aware of the things above? i.e, would we introduce either $tool.$domain.org or toolforge.org/$tool in the web grid. My feeling is that this can be done later after the new k8s is already in place.
SSL termination

Will try to summarize here the different options:

Diagram 0: the current setup. Dynamicproxy redirects tools.wmflabs.org/$tool to the right backend (be it the web grid or the legacy k8s).
Diagram 1: we introduce a new proxy in front of both the current setup and the new k8s. This proxy knows how to redirect *.toolforge.org to the new k8s and tools.wmflabs.org/$tool to dynamicproxy.
Diagram 2: the new k8s acts as proxy for the current setup, by means of the ingress. We can create an ingress rule to redirect all tools.wmflabs.org/$tool traffic to dynamicproxy
Diagram 3: proposed by @bd808 we update dynamicproxy to be in from of both the legacy setup and the new k8s.
Diagram 4: split setup. The current setup and the new k8s are totally separated. This is perhaps the most simple setup.

Details

Subject	Repo	Branch	Lines +/-
toolforge: proxy: adjust setup for the new k8s cluster	operations/puppet	production	+14 -28
toolforge: rename k8s::apilb role/profile to k8s::haproxy	operations/puppet	production	+12 -12
toolforge: k8s: adjust ports in the ingress setup	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	• Bstorm	T246122 Upgrade the Toolforge Kubernetes cluster to v1.16
		Restricted Task
Resolved	bd808	T232536 Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail
Resolved	• Bstorm	T236565 "tools" Cloud VPS project jessie deprecation
Resolved	aborrero	T101651 Set up toolsbeta more fully to help make testing easier
Resolved	• Bstorm	T166949 Homedir/UID info breaks after a while in Tools Kubernetes (can't read replica.my.cnf)
Resolved	• Bstorm	T246059 Add admin account creation to maintain-kubeusers
Resolved	• Bstorm	T154504 Make webservice backend default to kubernetes
Declined	None	T245230 Investigate cpu/ram requests and limits for DaemonSets pods
Resolved	• Bstorm	T214513 Deploy and migrate tools to a Kubernetes v1.15 or newer cluster
Resolved	aborrero	T215531 Deploy upgraded Kubernetes to toolsbeta
Resolved	aborrero	T228500 Toolforge: evaluate ingress mechanism
Resolved	aborrero	T234037 Toolforge ingress: decide on final layout of north-south proxy setup
Resolved	aborrero	T235059 Toolforge: refresh puppet code for proxy (dynamicproxy) to support Debian Buster

Event Timeline

aborrero created this task.Sep 27 2019, 12:43 PM

aborrero triaged this task as High priority.Sep 27 2019, 12:47 PM

aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

aborrero updated the task description. (Show Details)Sep 27 2019, 5:21 PM

For the record, @bd808 mentioned another option: having dynamicproxy understand how to forward to the new k8s cluster.

I've been playing with option 2, and here my tests:

Create a service and ingress object like the following:

root@toolsbeta-test-k8s-master-1:~# cat toolforge-legacy.yaml 
apiVersion: v1
kind: Service
metadata:
  name: toolforge-legacy
spec:
  type: ExternalName
  externalName: tools.wmflabs.org
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: toolforge-legacy
  annotations:
    kubernetes.io/ingress.class: "nginx"
spec:
  rules:
  - host: tools.wmflabs.org
    http:
      paths:
      - path: /
        backend:
          serviceName: toolforge-legacy
          servicePort: 80

Load it and try!

aborrero@toolsbeta-test-k8s-lb-01:~ $ curl localhost/openstack-browser -H "Host:tools.wmflabs.org" -L
<!DOCTYPE HTML>
[..]
          <a class="navbar-brand" href="/openstack-browser/">OpenStack browser</a>
        </div>
[..]

The nginx-ingress pod seems very happy processing this:

192.168.44.192 - [192.168.44.192] - - [04/Oct/2019:10:47:22 +0000] "GET /openstack-browser HTTP/1.1" 301 185 "-" "curl/7.52.1" 97 0.004 [default-toolforge-legacy-80] [] 172.16.6.39:80 185 0.004 301 3e16cd459c76f2542fb4d4409b4b0203
192.168.44.192 - [192.168.44.192] - - [04/Oct/2019:10:48:12 +0000] "GET /openstack-browser/project HTTP/1.1" 301 185 "-" "curl/7.52.1" 105 0.000 [default-toolforge-legacy-80] [] 172.16.6.39:80 185 0.000 301 f4d4fd5b98934828e74fd9ded97b6c10

This option seems pretty straight forward. The client recvs a 301 redirect for SSL from dynamicproxy:

< HTTP/1.1 301 Moved Permanently
< Server: openresty/1.15.8.1
< Location: https://tools.wmflabs.org/openstack-browser

So in this setup we may not even care about handling SSL for the legacy toolforge (web grid + legacy k8s).

Conclusions: this option seems simple and straightforward, unless I'm overlooking something else.

aborrero updated the task description. (Show Details)Oct 4 2019, 11:06 AM

My first inclination is that #3 is the most straightforward and supportable, but I know I am biased a bit because I am most accustomed to supporting that sort of setup in Toolforge. It also can simply be a matter of teaching software that is already in our laps how to route to "new" k8s vs old and doesn't require a new domain name scheme that may be unpopular and damaging to many tools (we don't know yet, but I remember the trouble I caused when I changed schemas on the wiki replicas--I know for many the change will be very exciting and good).

If we want technology that other people develop for and support (like Kubernetes) to be our future infrastructure foundation, option 2 makes more sense because then our custom stuff can be more easily deprecated since it is behind that. It also would allow us to start thinking of Kubernetes technologies as more of the "Toolforge platform", using things like CRDs and operators as customizations (even ones other people develop like Open Policy Agent), etc.--or at least our glue hacks can run inside k8s, which keeps them live for us, lol.

If we did option 2 and introduced the new domain, but we also allowed path-based routing for those who needed it (I know that's trickier) it might be a good balance. I think the noisiest voices on the topic want to switch to subdomains, but I'm thinking of things like CORS rules and wiki restrictions that may bite tool authors and be easier to handle in paths than subdomains. A lot of that is broken by changing domain ANYWAY, so maybe that doesn't matter. I'm just trying to get my thoughts down somewhere.

Mentioned in SAL (#wikimedia-cloud) [2019-10-08T12:27:56Z] <arturo> created VM toolsbeta-test-proxy-01 for testing stuff related to T234037

Mentioned in SAL (#wikimedia-cloud) [2019-10-08T14:14:54Z] <arturo> created puppet prefix toolsbeta-test-proxy for testing stuff related to T234037

aborrero mentioned this in T235252: Toolforge: SSL support for new domain toolforge.org.Oct 11 2019, 8:56 AM

We had a meeting yesterday 2019-10-10 and we decided to try option 3 first, with fallback to option 2.

The general front proxy will be dynamicproxy, which will keep more or less the same setup but include a fall through route to the new k8s deployment.
Also, we will try introducing the toolforge.org domain (T234617) if we manage to address T235252: Toolforge: SSL support for new domain toolforge.org in time.

Mentioned in SAL (#wikimedia-cloud) [2019-10-14T12:26:04Z] <arturo> created security group arturo-test-dynamicproxy-backend to tests stuff related to T234037

Ok, I've been playing with the dynamicproxy nginx+lua components and I have a working setup. I disabled SSL/https in my tests until we handle T235252: Toolforge: SSL support for new domain toolforge.org.

This is more or less the diagram of the setup:

Right now, the LUA code has a fall-through mechanism to direct by default to the admin tool, which gracefully handles the "Tool not found" situation.
In the setup we agreed on to accommodate the new cluster, this mechanism should be different, because now the fall-through proxy is for the new k8s cluster. This is probably something to handle in T234032: Toolforge ingress: create a default landing page for unknown/default URLs

Anyway, the changes in the LUA code are mostly to prevent it from generating the fall-through:

--- 1.lua	2019-10-14 17:59:16.429212877 +0200
+++ 2.lua	2019-10-14 17:59:29.221265917 +0200
@@ -40,37 +40,13 @@
 end
 
 if not route then
-    -- No routes defined for this uri, try the default (admin) prefix instead
-    rest = ngx.var.uri
-    routes_arr = red:hgetall('prefix:admin')
-    if routes_arr then
-        local routes = red:array_to_hash(routes_arr)
-        for pattern, backend in pairs(routes) do
-            if ngx.re.match(rest, pattern) then
-                route = backend
-                break
-            end
-        end
-    end
+    -- No routes defined for this uri, hope nginx can handle this! (new k8s cluster?)
+    ngx.exit(ngx.OK)
 end
 
 -- Use a connection pool of 256 connections with a 32s idle timeout
 -- This also closes the current redis connection.
 red:set_keepalive(1000 * 32, 256)
 
-if route then
-    ngx.var.backend = route
-    ngx.exit(ngx.OK)
-else
-    -- Oh noes!  Even the admin prefix is dead!
-    -- Fall back to the static site
-    if rest then
-        -- the URI had a slash, so the user clearly expected /something/
-        -- there.  Fail because there is no registered webservice.
-        ngx.exit(503)
-    else
-        ngx.var.backend = ''
-        ngx.exit(ngx.OK)
-    end
-end
-
+ngx.var.backend = route
+ngx.exit(ngx.OK)

The change in the nginx side is very small. We simply add a backend if LUA couldn't find it. This backend is the haproxy of the new k8s cluster.

[..]
        set $backend '';

        access_by_lua_file /etc/nginx/lua/urlproxy.lua;

        if ($backend = '') {
            # no backend was found in redis, send this to the new k8s cluster
            set $backend 'http://toolsbeta-k8s-master.toolsbeta.wmflabs.org:80';
        }

        proxy_pass $backend;
[..]

I decided to target haproxy instead of a worker node directly for a couple of reasons:

the list of backend servers to use by haproxy is maintained in hiera.
we need some way to know which worker nodes we have, and to live-prove them. I think haproxy works fine for this.
we are using haproxy for the new k8s apiservers anyway. So this is reusing a piece of infra we already have.
I considered storing the info about the worker nodes in redis (or in nginx somehow) but I don't think that would very elegant.

Results, the same nginx handle both domains and URI schemes:

aborrero@tools-test-proxy-01:~$ curl -L localhost:80 -H "Host:hello.toolforge.org" 2>/dev/null ; echo | head
Hello World!
aborrero@tools-test-proxy-01:~$ curl -L localhost:80/openstack-browser -H "Host:tools.wmflabs.org" 2>/dev/null | head
<!DOCTYPE HTML>
<html lang="en">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <meta http-equiv="Content-Language" content="en-us">
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="initial-scale=1.0, user-scalable=yes, width=device-width">
    <meta http-equiv="imagetoolbar" content="no">
    <meta name="robots" content="noindex">

Worth noting that all my tests were conducted in a tools-proxy server running Debian Buster (T235059)

TL;DR: this works just fine. I will prepare patches, documentation and a follow-up plan, since this seems to be reaching a reasonable shape.

Change 543135 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: proxy: adjust setup for the new k8s cluster

https://gerrit.wikimedia.org/r/543135

gerritbot added a project: Patch-For-Review.Oct 15 2019, 12:58 PM

Change 543137 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: adjust ports in the ingress setup

https://gerrit.wikimedia.org/r/543137

aborrero closed subtask T235059: Toolforge: refresh puppet code for proxy (dynamicproxy) to support Debian Buster as Resolved.Oct 16 2019, 11:24 AM

Drafting docs at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Networking_and_ingress

Change 543137 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: k8s: adjust ports in the ingress setup

https://gerrit.wikimedia.org/r/543137

Change 544191 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: rename k8s::apilb role/profile to k8s::haproxy

https://gerrit.wikimedia.org/r/544191

I consider this done.

Change 544191 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: rename k8s::apilb role/profile to k8s::haproxy

https://gerrit.wikimedia.org/r/544191

@aborrero I have lots of local hacks on the toolsbeta bastion right now, so please don't enable puppet, but!
I have a good example of a tool running via webservice in the new setup and it isn't working with the ingress.

The below should work.

toolsbeta.test@toolsbeta-sgebastion-04:~$ curl http://toolsbeta.wmflabs.org/test/
<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.14.2</center>
</body>
</html>

My ingress object is:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
  creationTimestamp: "2019-11-07T22:20:14Z"
  generation: 2
  labels:
    name: test
    toolforge: tool
    tools.wmflabs.org/webservice: "true"
    tools.wmflabs.org/webservice-version: "1"
  name: test
  namespace: tool-test
  resourceVersion: "2678283"
  selfLink: /apis/extensions/v1beta1/namespaces/tool-test/ingresses/test
  uid: 9b030ed5-fe66-4592-8d13-bb36e6d3dbe4
spec:
  rules:
  - host: toolsbeta.wmflabs.org
    http:
      paths:
      - backend:
          serviceName: test
          servicePort: 8000
        path: /test
status:
  loadBalancer: {}

So this is very cool. We have a proper test case on the new cluster with a proxy to check it. Now we just need to figure out why that didn't work.

The 502 message seems produced by the front proxy (dynamicproxy). I'm taking a look.

In T234037#5647283, @aborrero wrote:

The 502 message seems produced by the front proxy (dynamicproxy). I'm taking a look.

Confirmed by tcpdump. No packet reach haproxy, thus not reaching k8s ingress at all. I keep investigating.

I see this tool is deployed in the new k8s cluster, but somehow the front proxy (dynamicproxy) is trying to deliver the connection to the old k8s setup:

2019/11/08 11:05:02 [error] 31239#31239: *2502 connect() failed (113: No route to host) while connecting to upstream, client: 172.16.3.240, server: , request: "GET /test/ HTTP/1.1", upstream: "http://192.168.29.227:8000/test/", host: "toolsbeta.wmflabs.org"

However I see the test tool running in the new k8s cluster:

root@toolsbeta-test-k8s-control-1:~# kubectl get pods -n tool-test
NAME                   READY   STATUS    RESTARTS   AGE
test-5d5f87b66-2hfvv   1/1     Running   0          12h

Confirmed that Redis is storing data for this:

127.0.0.1:6379> HGETALL prefix:test
1) ".*"
2) "http://192.168.29.227:8000"

However this tool is not running in the old k8s cluster:

root@toolsbeta-k8s-master-01:~# kubectl get pods --all-namespaces
NAMESPACE   NAME                     READY     STATUS    RESTARTS   AGE
admin       admin-1850377006-u6foo   1/1       Running   3          1y

Why does redis have this information? Could it be running in the grid?

Also, another question I have: if the tool is running in one of the legacy systems (old k8s, grid) and is thus in Redis, and then the tool moves to the new k8s, we have a process that removes Redis information, right?

Something is weird, the test tool in toolsbeta seems to be running somewhere:

aborrero@toolsbeta-sgebastion-04:~ 15s $ sudo become test
toolsbeta.test@toolsbeta-sgebastion-04:~$ webservice status
Your webservice of type python is running
toolsbeta.test@toolsbeta-sgebastion-04:~$ cat service.manifest
# This file is used by toollabs infrastructure.
# Please do not edit manually at this time.
backend: kubernetes
distribution: debian
version: 2
web: python
toolsbeta.test@toolsbeta-sgebastion-04:~$ webservice start
Your job is already running
toolsbeta.test@toolsbeta-sgebastion-04:~$ webservice stop
Stopping webservice...............
toolsbeta.test@toolsbeta-sgebastion-04:~$ webservice status
Your webservice is not running

After stopping the webservice, Redis still thinks it should store info about it:

127.0.0.1:6379> HGETALL prefix:test
1) ".*"
2) "http://192.168.29.227:8000"

Let's delete that information by hand!

127.0.0.1:6379> HDEL prefix:test .*
(integer) 1
127.0.0.1:6379> HGETALL prefix:test
(empty list or set)

And try again:

toolsbeta.test@toolsbeta-sgebastion-04:~$ curl toolsbeta.wmflabs.org/test
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>openresty/1.15.8.1</center>
</body>
</html>

This error message is produced by nginx-ingress in the new k8s cluster! I confirmed there are now packets flowing for this tool between dynamicproxy and haproxy.

Apparently webservice failed to delete the Service object in the old cluster (which I'll check into today). So that explains the first error at least.

BTW, if you are set to the new cluster, use /usr/bin/kubectl , which you probably already noticed, but just in case.

It works now!

toolsbeta.test@toolsbeta-sgebastion-04:~$ curl toolsbeta.wmflabs.org/test/
Hello World, from Toolsbeta!

Ingress logs for our reference. In this version of webservice, I have to edit the ingress after it is launched to be "toolsbeta.wmflabs.org" (the UPDATE below). That will not be true after https://gerrit.wikimedia.org/r/c/operations/software/tools-webservice/+/549613. I see I should also switch things up in the code so the ingress is created last and deleted first. The leaking service objects are interesting on the old grid. I will try to figure that out, if possible. It may be a problem with the API versions used in pykube.

I1108 15:39:05.350076       6 event.go:258] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"tool-test", Name:"test", UID:"81f75096-a8f7-469b-b9d0-244981433249", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"2803724", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress tool-test/test
W1108 15:39:08.689924       6 controller.go:878] Service "tool-test/test" does not have any active Endpoint.
I1108 15:39:08.690033       6 controller.go:133] Configuration changes detected, backend reload required.
I1108 15:39:09.000514       6 controller.go:149] Backend successfully reloaded.
I1108 15:39:52.141743       6 controller.go:133] Configuration changes detected, backend reload required.
I1108 15:39:52.141813       6 event.go:258] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"tool-test", Name:"test", UID:"81f75096-a8f7-469b-b9d0-244981433249", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"2803832", FieldPath:""}): type: 'Normal' reason: 'UPDATE' Ingress tool-test/test
I1108 15:39:52.325561       6 controller.go:149] Backend successfully reloaded

Change 543135 merged by Bstorm:
[operations/puppet@production] toolforge: proxy: adjust setup for the new k8s cluster

https://gerrit.wikimedia.org/r/543135

Maintenance_bot removed a project: Patch-For-Review.Dec 17 2019, 7:11 PM

Now that I've deployed this, it seems that the toolforge "main page" is now the fourohfour tool. I'm wondering if that's "wrong"...

It seems like we forgot the issue of the front page.

Mentioned in SAL (#wikimedia-cloud) [2019-12-17T19:20:54Z] <bstorm_> deployed the changes to the live proxy to enable the new kubernetes cluster T234037

All fixed. @bd808 set a redirect of / to /admin for now, and I adjusted the shinken monitor.
We will probably need a bit more nuance in the solution to start using subdomains.

bd808 merged a task: T129312: Setup a supported HTTP Ingress solution for Kubernetes.Jan 4 2020, 10:27 PM

bd808 added subscribers: yuvipanda, • chasemp, valhallasw, scfc.

	F30679386: image.png
	Oct 14 2019, 4:13 PM

	F30543332: image.png
	Oct 4 2019, 11:06 AM

Toolforge ingress: decide on final layout of north-south proxy setupClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Toolforge ingress: decide on final layout of north-south proxy setup
Closed, ResolvedPublic
Actions

Related Objects
Search...