Page MenuHomePhabricator

Create DataHub containers with deployment pipeline
Closed, ResolvedPublic

Description

We will need to use the deployment pipeline to create suitable containers for use in Kubernetes for the DataHub MVP

The four containers are:

  • Metadata Server (GMS)
  • DataHub Frontend
  • MCE Consumer Job
  • MAE Consumer Job

They all come from the same codebase: https://github.com/linkedin/datahub/

There are some example Dockerfiles for all of these services, which we can use as inspiration

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
BTullis moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.

I am beginning to look at this task now.

akosiaris added subscribers: dduvall, thcipriani, jeena.

Adding Release Engineering team for their awareness and help with the integration/config repo that will be required to enable the pipeline on the repo. @BTullis first step is to clone the repo to gerrit and add .pipeline config files.

Thanks @akosiaris - I've done several things to starting kicking this off:

  • I've requested a new Gerrit repository analytics/datahub with a fork of https://github.com/linkedin/datahub/ - via the mechanism here: https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests
  • I've also bumped my request for group access to analytics in gerrit, so that I can push to it.
  • I've also requested that a long-lived branch named wmf branch be created in this repository, where we can track our changes to the codebase and rebase against master as the upstream changes are brought in.

Once we get to this point, I'll be able to create the .pipeline/blubber.yaml and .pipeline/config.yaml files in this branch.

Linking this tutorial for the deployment pipeline in case you find it helpful: https://wikitech.wikimedia.org/wiki/Deployment_pipeline/Migration/Tutorial

The repository has now been created and I have permission to push to it:
I have created the wmf branch I will begin work on the pipeline configuration now.
https://gerrit.wikimedia.org/g/analytics/datahub/+/refs/heads/wmf
Many thanks for everyone's help so far.

@BTullis I saw the task passing by, what is the goal of forking the linkedin's datahub repo? Are you interested only in https://github.com/linkedin/datahub/search?l=Dockerfile&q=docker ?

If so we could use the https://gerrit.wikimedia.org/r/admin/repos/operations/docker-images/production-images repository, and build those images directly from there. I used it for Istio/Knative/Kserve/etc.., it works really well.

@elukey - Oh right, yes that looks really useful. So this would be the intermediate step from here, right? https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Production_images
Then I specify this new production image in my Blubberfile.

@elukey - Oh right, yes that looks really useful. So this would be the intermediate step from here, right? https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Production_images
Then I specify this new production image in my Blubberfile.

Simpler than that - you can add your docker files (including creating one called usually build, that pulls the github repo and build go code etc..) and push your docker images directly to the registry without any blubber config or extra repo. Then in the deployment-charts repo, where you'll define the helmfile/helm config to deploy, you'll reference those images.

Blubber is useful, in my understanding, if you have to create new Docker images for services like eventgate api-gateway etc.., where you define the code and you want to wrap everything in a container/image. In this case, and correct me if I am wrong, you just need to:

  1. build the go code (or whatever) for the services outlined in the description.
  2. put the code in dedicated docker images and build them.

You can check the production-images repo for example, if you want to see go code check the kserve config. There is a build docker image that pulls the github repo and that builds the binaries, and then those are copied to other dedicated images.

@akosiaris is there any problem in avoiding the pipeline/blubber config and going through production-images instead?

@elukey - Oh right, yes that looks really useful. So this would be the intermediate step from here, right? https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Production_images
Then I specify this new production image in my Blubberfile.

Simpler than that - you can add your docker files (including creating one called usually build, that pulls the github repo and build go code etc..) and push your docker images directly to the registry without any blubber config or extra repo. Then in the deployment-charts repo, where you'll define the helmfile/helm config to deploy, you'll reference those images.

Blubber is useful, in my understanding, if you have to create new Docker images for services like eventgate api-gateway etc.., where you define the code and you want to wrap everything in a container/image. In this case, and correct me if I am wrong, you just need to:

It was built to enforce good practices regarding Dockerfiles for services that will be running in production (it's extremely easy to write a bad Dockerfile, the plethora of guides out there in how to write a good Dockerfile makes that abundantly clear). It abstracts away things like correct file and directory owerships, correct posix users, proper layering to avoid cache busting situations and much more. [1]

It's also a requirement for deploying anything in the Deployment Pipeline and since Datahub is going to be (even temporarily) deployed in the pipeline, it needs to use it.

[1] https://wikitech.wikimedia.org/wiki/Blubber/Idea

@akosiaris I am a little confused then, I didn't use any of it for istio/knative/kserve/etc.., those were all services already wrapped in Docker images (that of course were vetted and modified when added to production-images).

In this case, what is the difference with a service like Istio? (deployment wise I mean). I am asking to understand what to do, I am completely fine to change the current way of doing things, but I am a little confused :)

@akosiaris I am a little confused then, I didn't use any of it for istio/knative/kserve/etc.., those were all services already wrapped in Docker images (that of course were vetted and modified when added to production-images).

In this case, what is the difference with a service like Istio? (deployment wise I mean). I am asking to understand what to do, I am completely fine to change the current way of doing things, but I am a little confused :)

Fair enough. Let me try and answer that (and we should probably document the answer in Wikitech).

We can distinguish 3 different planes[1] of components as far as the wikikube and staging kubernetes clusters go. The ml-serve cluster is heavily influenced by those clusters so the categorization still stands. That being said, I do expect that at some point we might have to re-evaluate some things though.

  • Control-plane. That's kube-apiserver, kube-controller-manager, kube-scheduler, kubelet, kube-proxy and lastly etcd. In our clusters those are never deployed via containers (or even more inside kubernetes itself). For those, even discussing the Deployment Pipeline doesn't make sense.
  • Workload/services plane. This is your everyday service. It's deployed by the Deployment Pipeline (and thus uses OCI containers and of course Blubber). Typical examples would be in-house built services like echostore, sessionstore, cxserver, similar-users etc. However, nothing says that they ABSOLUTEY HAVE to be written in-house. We also have e.g. apertium or zotero, which aren't written or maintained (primarily at least) in house. In zotero's case we just cloned the repo to gerrit and added the proper patches to enable the pipeline. When updates are in order, they are just merged on top of the branch. Deployment of these services is done by each respective deployer. The entire point of the pipeline is exactly that in fact. That every deployer is unblocked and unencumbered and can deploy new versions of the software without relying on other teams (they do rely on other teams for the initialization but that's a once off cost).
  • Cluster level components. Those are components whose role is clearly distinct from the control plane, but also are crucial to a (well) functioning cluster. They offer important functions like TLS demarcation, that is both termination and initiation, (envoy), DNS (CoreDNS), event logging (eventrouter), networking (Calico) or logging (rsyslog) or metrics (statsd-exporter). Those might or might not be deployed via containers, however the latter is an exception right now (rsyslog).

Deployment of these cluster level components might also be per cluster (e.g. Calico) or per workload/service (e.g. envoy for the services proxy, or statsd-exporter).

In the former case, the deployment MUST happen by an SRE. In this case, the pipeline's benefits are way fewer as deployment rights aren't given to non SREs. There's also some other inherent dangers which are the reasons we chose to not couple those components to the pipeline, e.g. the deployment pipeline requires that the wikikube kubernetes cluster is functional (as Blubberoid is hosted on it).

For the latter case, that is cluster level components that well end being deployed per service (and aren't truly cluster level after all)... well see my [1] note. While strictly speaking every deployer can deploy those, in essense SREs are the only ones doing so and up to now we 've been using the same approach as the former case for simplicity's sake. We could re-evaluate that practice, but that's irrelevant to the current discussion I think.

Istio clearly belongs in the cluster level components plane

Datahub on the other hand, clearly belongs on the workload/services plane.

[1] By planes I don’t mean airplanes but more like planes of existence. So for example on the DnD mythologies we would have the material plane, but also the etherial plane and th shadow plane. In norse mythology, asgard/midgard etc. I am mostly joking, but this is just to point out there's separation between the "planes" but also subtle ways where the "barriers" might be breaking down every now and then

Change 762454 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@master] Add work in progress blubber files for deployment pipeline

https://gerrit.wikimedia.org/r/762454

Thanks both. This is really helpful.
I have begun work on the deployment pipeline work and started with 4 blubber.yaml files.
Here's the work in progress, although it's still at a relatively early stage.
https://gerrit.wikimedia.org/r/c/analytics/datahub/+/762454

As I understand it I'll need these 4 blubber files, because each pipeline only pushes one production container per blubber run/

However, I can use several pipelines within the same codebas within config.yaml. https://wikitech.wikimedia.org/wiki/PipelineLib/Reference#Pipelines

Change 762950 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Add blubber files for deployment pipeline

https://gerrit.wikimedia.org/r/762950

Change 763207 had a related patch set uploaded (by Btullis; author: Btullis):

[integration/config@master] Define pipelines for datahub

https://gerrit.wikimedia.org/r/763207

I'm pretty happy with the progress on the blubber files and the pipeline config, but I'm awaiting review.

There are a few things that I still have to work out:

  • Will I have to use the web proxy as part of the build. If so, how will this be configured. (gradle, wget, yarn etc)
  • I am not sure how to restrict the pipeline activation to the wmf branch.
  • Or maybe we should be running the pipeline on all branches, but only publishing the images on the wmf branch.

Change 763207 merged by jenkins-bot:

[integration/config@master] Define pipelines for datahub

https://gerrit.wikimedia.org/r/763207

Change 763724 had a related patch set uploaded (by Btullis; author: Btullis):

[integration/config@master] Improve the datahub pipelines

https://gerrit.wikimedia.org/r/763724

Change 763724 merged by jenkins-bot:

[integration/config@master] Improve the datahub pipelines

https://gerrit.wikimedia.org/r/763724

Change 762454 abandoned by Btullis:

[analytics/datahub@master] Add work in progress blubber files for deployment pipeline

Reason:

Superseded

https://gerrit.wikimedia.org/r/762454

Change 762950 merged by jenkins-bot:

[analytics/datahub@wmf] Add configuration for deployment pipeline

https://gerrit.wikimedia.org/r/762950

I'm happy with the deployment pipeline part now, so I'm calling this task done.

The containers are built as a test on every push to analytics/datahub but only when merged to the wmf branch are they published to the registry with a latest tag.

There is still one file that is downloaded from GitHub during the image preparation stage, but I will look at how to improve this in a later iteration.

I could probably reduce the size of the images by copying the bulid artifacts out of the container during the build step, then copying them back in during a prepare step, but I don't think that this is essential.

Actually, it's not quite done. The containers are all built with the same name, so I'll need to modify that.

Change 763795 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Update image names and the tag

https://gerrit.wikimedia.org/r/763795

Change 763795 merged by jenkins-bot:

[analytics/datahub@wmf] Update image names and the tag

https://gerrit.wikimedia.org/r/763795

Change 767507 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Correct the location of the MAE and MCE consumer jars

https://gerrit.wikimedia.org/r/767507

Change 767507 merged by Btullis:

[analytics/datahub@wmf] Correct the location of the MAE and MCE consumer jars

https://gerrit.wikimedia.org/r/767507

Change 767725 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Change the CWD of the datahub-frontend

https://gerrit.wikimedia.org/r/767725

Change 767725 abandoned by Btullis:

[analytics/datahub@wmf] Change the directory structure of datahub-frontend

Reason:

Blubber doesn't like extracting to the root directory

https://gerrit.wikimedia.org/r/767725

Change 767743 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Update the path of the user.props file

https://gerrit.wikimedia.org/r/767743

Change 767743 merged by Btullis:

[analytics/datahub@wmf] Update the path of the user.props file

https://gerrit.wikimedia.org/r/767743

Change 767783 had a related patch set uploaded (by Btullis; author: Btullis):

[integration/config@master] Add three more container build pipelines to datahub

https://gerrit.wikimedia.org/r/767783

Change 767783 merged by jenkins-bot:

[integration/config@master] Add three more container build pipelines to datahub

https://gerrit.wikimedia.org/r/767783

Change 767837 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Fix the entrypoint

https://gerrit.wikimedia.org/r/767837

Change 767837 merged by Btullis:

[analytics/datahub@wmf] Fix the entrypoint

https://gerrit.wikimedia.org/r/767837

Change 767847 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Edit the entrypoint and setup scripts

https://gerrit.wikimedia.org/r/767847

Change 767847 merged by Btullis:

[analytics/datahub@wmf] Edit the entrypoint and setup scripts

https://gerrit.wikimedia.org/r/767847

Change 773479 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Tag using date-stage as well as latest

https://gerrit.wikimedia.org/r/773479

Change 773479 merged by Btullis:

[analytics/datahub@wmf] Tag using date-stage as well as latest

https://gerrit.wikimedia.org/r/773479

Change 773500 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Use git commit SHA for image label

https://gerrit.wikimedia.org/r/773500

Change 773500 merged by Btullis:

[analytics/datahub@wmf] Use git commit SHA for image label

https://gerrit.wikimedia.org/r/773500

Change 778224 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Add wmf-certificates to each datahub container

https://gerrit.wikimedia.org/r/778224

Change 778224 merged by jenkins-bot:

[analytics/datahub@wmf] Add wmf-certificates to each datahub container

https://gerrit.wikimedia.org/r/778224

Change 779888 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Add WMF customization of datahub

https://gerrit.wikimedia.org/r/779888

Change 779888 merged by jenkins-bot:

[analytics/datahub@wmf] Add WMF customization of datahub

https://gerrit.wikimedia.org/r/779888

Change 900310 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Experimental refactor of the datahub container build process

https://gerrit.wikimedia.org/r/900310

Change 900310 abandoned by Btullis:

[analytics/datahub@wmf] Experimental refactor of the datahub container build process

Reason:

Experiment concluded

https://gerrit.wikimedia.org/r/900310