Create DataHub containers with deployment pipeline
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BTullis
	Feb 10 2022, 11:36 AM

Description

We will need to use the deployment pipeline to create suitable containers for use in Kubernetes for the DataHub MVP

The four containers are:

Metadata Server (GMS)
DataHub Frontend
MCE Consumer Job
MAE Consumer Job

They all come from the same codebase: https://github.com/linkedin/datahub/

There are some example Dockerfiles for all of these services, which we can use as inspiration

Details

Subject	Repo	Branch	Lines +/-
Experimental refactor of the datahub container build process	analytics/datahub	wmf	+66 -878 K
Add WMF customization of datahub	analytics/datahub	wmf	+504 -23
Add wmf-certificates to each datahub container	analytics/datahub	wmf	+38 -4
Use git commit SHA for image label	analytics/datahub	wmf	+7 -0
Tag using date-stage as well as latest	analytics/datahub	wmf	+7 -7
Edit the entrypoint and setup scripts	analytics/datahub	wmf	+7 -3
Fix the entrypoint	analytics/datahub	wmf	+1 -1
Add three more container build pipelines to datahub	integration/config	master	+15 -0
Update the path of the user.props file	analytics/datahub	wmf	+1 -1
Change the directory structure of datahub-frontend	analytics/datahub	wmf	+5 -5
Correct the location of the MAE and MCE consumer jars	analytics/datahub	wmf	+2 -2
Update image names and the tag	analytics/datahub	wmf	+8 -4
Add configuration for deployment pipeline	analytics/datahub	wmf	+274 -6
Add work in progress blubber files for deployment pipeline	analytics/datahub	master	+87 -0
Improve the datahub pipelines	integration/config	master	+24 -13
Define pipelines for datahub	integration/config	master	+23 -0

Related Objects
Search...

Status	Assigned	Task
Resolved	BTullis	T299910 Data Catalog MVP
Resolved	BTullis	T301453 Create DataHub containers with deployment pipeline
Resolved	BTullis	T301454 Define the Helm charts and helmfile deployments for Datahub
Resolved	JMeybohm	T303049 New Service Request: DataHub
Resolved	BTullis	T301460 Update DNS for the DataHub MVP services
Resolved	BTullis	T301886 Upload required datahub dependencies to Archiva

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

BTullis added a project: Data-Catalog.Feb 10 2022, 11:51 AM

I am beginning to look at this task now.

BTullis triaged this task as High priority.Feb 11 2022, 10:22 AM

BTullis added a subtask: T301454: Define the Helm charts and helmfile deployments for Datahub.

Adding Release Engineering team for their awareness and help with the integration/config repo that will be required to enable the pipeline on the repo. @BTullis first step is to clone the repo to gerrit and add .pipeline config files.

BTullis moved this task from Backlog to In Progress on the Data-Catalog board.Feb 11 2022, 2:31 PM

Thanks @akosiaris - I've done several things to starting kicking this off:

I've requested a new Gerrit repository analytics/datahub with a fork of https://github.com/linkedin/datahub/ - via the mechanism here: https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests
I've also bumped my request for group access to analytics in gerrit, so that I can push to it.
I've also requested that a long-lived branch named wmf branch be created in this repository, where we can track our changes to the codebase and rebase against master as the upstream changes are brought in.

Once we get to this point, I'll be able to create the .pipeline/blubber.yaml and .pipeline/config.yaml files in this branch.

BTullis edited parent tasks, added: T299910: Data Catalog MVP; removed: T301385: Deploy DataHub in MVP phase.Feb 11 2022, 2:51 PM

Linking this tutorial for the deployment pipeline in case you find it helpful: https://wikitech.wikimedia.org/wiki/Deployment_pipeline/Migration/Tutorial

The repository has now been created and I have permission to push to it:
I have created the wmf branch I will begin work on the pipeline configuration now.
https://gerrit.wikimedia.org/g/analytics/datahub/+/refs/heads/wmf
Many thanks for everyone's help so far.

@BTullis I saw the task passing by, what is the goal of forking the linkedin's datahub repo? Are you interested only in https://github.com/linkedin/datahub/search?l=Dockerfile&q=docker ?

If so we could use the https://gerrit.wikimedia.org/r/admin/repos/operations/docker-images/production-images repository, and build those images directly from there. I used it for Istio/Knative/Kserve/etc.., it works really well.

@elukey - Oh right, yes that looks really useful. So this would be the intermediate step from here, right? https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Production_images
Then I specify this new production image in my Blubberfile.

In T301453#7707085, @BTullis wrote:

@elukey - Oh right, yes that looks really useful. So this would be the intermediate step from here, right? https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Production_images
Then I specify this new production image in my Blubberfile.

Simpler than that - you can add your docker files (including creating one called usually build, that pulls the github repo and build go code etc..) and push your docker images directly to the registry without any blubber config or extra repo. Then in the deployment-charts repo, where you'll define the helmfile/helm config to deploy, you'll reference those images.

Blubber is useful, in my understanding, if you have to create new Docker images for services like eventgate api-gateway etc.., where you define the code and you want to wrap everything in a container/image. In this case, and correct me if I am wrong, you just need to:

build the go code (or whatever) for the services outlined in the description.
put the code in dedicated docker images and build them.

You can check the production-images repo for example, if you want to see go code check the kserve config. There is a build docker image that pulls the github repo and that builds the binaries, and then those are copied to other dedicated images.

@akosiaris is there any problem in avoiding the pipeline/blubber config and going through production-images instead?

In T301453#7707100, @elukey wrote:

In T301453#7707085, @BTullis wrote:

@elukey - Oh right, yes that looks really useful. So this would be the intermediate step from here, right? https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Production_images
Then I specify this new production image in my Blubberfile.

Simpler than that - you can add your docker files (including creating one called usually build, that pulls the github repo and build go code etc..) and push your docker images directly to the registry without any blubber config or extra repo. Then in the deployment-charts repo, where you'll define the helmfile/helm config to deploy, you'll reference those images.

Blubber is useful, in my understanding, if you have to create new Docker images for services like eventgate api-gateway etc.., where you define the code and you want to wrap everything in a container/image. In this case, and correct me if I am wrong, you just need to:

It was built to enforce good practices regarding Dockerfiles for services that will be running in production (it's extremely easy to write a bad Dockerfile, the plethora of guides out there in how to write a good Dockerfile makes that abundantly clear). It abstracts away things like correct file and directory owerships, correct posix users, proper layering to avoid cache busting situations and much more. [1]

It's also a requirement for deploying anything in the Deployment Pipeline and since Datahub is going to be (even temporarily) deployed in the pipeline, it needs to use it.

[1] https://wikitech.wikimedia.org/wiki/Blubber/Idea

@akosiaris I am a little confused then, I didn't use any of it for istio/knative/kserve/etc.., those were all services already wrapped in Docker images (that of course were vetted and modified when added to production-images).

In this case, what is the difference with a service like Istio? (deployment wise I mean). I am asking to understand what to do, I am completely fine to change the current way of doing things, but I am a little confused :)

In T301453#7707174, @elukey wrote:

@akosiaris I am a little confused then, I didn't use any of it for istio/knative/kserve/etc.., those were all services already wrapped in Docker images (that of course were vetted and modified when added to production-images).

In this case, what is the difference with a service like Istio? (deployment wise I mean). I am asking to understand what to do, I am completely fine to change the current way of doing things, but I am a little confused :)

Fair enough. Let me try and answer that (and we should probably document the answer in Wikitech).

We can distinguish 3 different planes[1] of components as far as the wikikube and staging kubernetes clusters go. The ml-serve cluster is heavily influenced by those clusters so the categorization still stands. That being said, I do expect that at some point we might have to re-evaluate some things though.

Control-plane. That's kube-apiserver, kube-controller-manager, kube-scheduler, kubelet, kube-proxy and lastly etcd. In our clusters those are never deployed via containers (or even more inside kubernetes itself). For those, even discussing the Deployment Pipeline doesn't make sense.
Workload/services plane. This is your everyday service. It's deployed by the Deployment Pipeline (and thus uses OCI containers and of course Blubber). Typical examples would be in-house built services like echostore, sessionstore, cxserver, similar-users etc. However, nothing says that they ABSOLUTEY HAVE to be written in-house. We also have e.g. apertium or zotero, which aren't written or maintained (primarily at least) in house. In zotero's case we just cloned the repo to gerrit and added the proper patches to enable the pipeline. When updates are in order, they are just merged on top of the branch. Deployment of these services is done by each respective deployer. The entire point of the pipeline is exactly that in fact. That every deployer is unblocked and unencumbered and can deploy new versions of the software without relying on other teams (they do rely on other teams for the initialization but that's a once off cost).

Cluster level components. Those are components whose role is clearly distinct from the control plane, but also are crucial to a (well) functioning cluster. They offer important functions like TLS demarcation, that is both termination and initiation, (envoy), DNS (CoreDNS), event logging (eventrouter), networking (Calico) or logging (rsyslog) or metrics (statsd-exporter). Those might or might not be deployed via containers, however the latter is an exception right now (rsyslog).

Deployment of these cluster level components might also be per cluster (e.g. Calico) or per workload/service (e.g. envoy for the services proxy, or statsd-exporter).

In the former case, the deployment MUST happen by an SRE. In this case, the pipeline's benefits are way fewer as deployment rights aren't given to non SREs. There's also some other inherent dangers which are the reasons we chose to not couple those components to the pipeline, e.g. the deployment pipeline requires that the wikikube kubernetes cluster is functional (as Blubberoid is hosted on it).

For the latter case, that is cluster level components that well end being deployed per service (and aren't truly cluster level after all)... well see my [1] note. While strictly speaking every deployer can deploy those, in essense SREs are the only ones doing so and up to now we 've been using the same approach as the former case for simplicity's sake. We could re-evaluate that practice, but that's irrelevant to the current discussion I think.

Istio clearly belongs in the cluster level components plane

Datahub on the other hand, clearly belongs on the workload/services plane.

[1] By planes I don’t mean airplanes but more like planes of existence. So for example on the DnD mythologies we would have the material plane, but also the etherial plane and th shadow plane. In norse mythology, asgard/midgard etc. I am mostly joking, but this is just to point out there's separation between the "planes" but also subtle ways where the "barriers" might be breaking down every now and then

Change 762454 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@master] Add work in progress blubber files for deployment pipeline

https://gerrit.wikimedia.org/r/762454

gerritbot added a project: Patch-For-Review.Feb 14 2022, 3:07 PM

Thanks both. This is really helpful.
I have begun work on the deployment pipeline work and started with 4 blubber.yaml files.
Here's the work in progress, although it's still at a relatively early stage.
https://gerrit.wikimedia.org/r/c/analytics/datahub/+/762454

As I understand it I'll need these 4 blubber files, because each pipeline only pushes one production container per blubber run/

However, I can use several pipelines within the same codebas within config.yaml. https://wikitech.wikimedia.org/wiki/PipelineLib/Reference#Pipelines

odimitrijevic moved this task from Incoming (new tickets) to Security & Governance on the Data-Engineering board.Feb 14 2022, 4:54 PM

Change 762950 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Add blubber files for deployment pipeline

https://gerrit.wikimedia.org/r/762950

Change 763207 had a related patch set uploaded (by Btullis; author: Btullis):

[integration/config@master] Define pipelines for datahub

https://gerrit.wikimedia.org/r/763207

I'm pretty happy with the progress on the blubber files and the pipeline config, but I'm awaiting review.

There are a few things that I still have to work out:

Will I have to use the web proxy as part of the build. If so, how will this be configured. (gradle, wget, yarn etc)
I am not sure how to restrict the pipeline activation to the wmf branch.
Or maybe we should be running the pipeline on all branches, but only publishing the images on the wmf branch.

BTullis closed subtask T301886: Upload required datahub dependencies to Archiva as Resolved.Feb 17 2022, 10:24 AM

Change 763207 merged by jenkins-bot:

[integration/config@master] Define pipelines for datahub

https://gerrit.wikimedia.org/r/763207

The first container build is happening now: https://integration.wikimedia.org/ci/job/datahub-pipeline-datahub-gms/2/console

Change 763724 had a related patch set uploaded (by Btullis; author: Btullis):

[integration/config@master] Improve the datahub pipelines

https://gerrit.wikimedia.org/r/763724

Change 763724 merged by jenkins-bot:

[integration/config@master] Improve the datahub pipelines

https://gerrit.wikimedia.org/r/763724

Mentioned in SAL (#wikimedia-releng) [2022-02-18T13:44:18Z] <hashar> Reloading Zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/763724 T301453

Change 762454 abandoned by Btullis:

[analytics/datahub@master] Add work in progress blubber files for deployment pipeline

Reason:

Superseded

https://gerrit.wikimedia.org/r/762454

Change 762950 merged by jenkins-bot:

[analytics/datahub@wmf] Add configuration for deployment pipeline

https://gerrit.wikimedia.org/r/762950

BTullis mentioned this in R3101:ec2888d19548: Add configuration for deployment pipeline.Feb 18 2022, 2:31 PM

I'm happy with the deployment pipeline part now, so I'm calling this task done.

The containers are built as a test on every push to analytics/datahub but only when merged to the wmf branch are they published to the registry with a latest tag.

There is still one file that is downloaded from GitHub during the image preparation stage, but I will look at how to improve this in a later iteration.

I could probably reduce the size of the images by copying the bulid artifacts out of the container during the build step, then copying them back in during a prepare step, but I don't think that this is essential.

BTullis moved this task from In Progress to Done on the Data-Engineering-Kanban board.Feb 18 2022, 2:45 PM

Actually, it's not quite done. The containers are all built with the same name, so I'll need to modify that.

Change 763795 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Update image names and the tag

https://gerrit.wikimedia.org/r/763795

Change 763795 merged by jenkins-bot:

[analytics/datahub@wmf] Update image names and the tag

https://gerrit.wikimedia.org/r/763795

BTullis mentioned this in R3101:01e266c1bc3e: Update image names and the tag.Feb 18 2022, 6:09 PM

BTullis moved this task from In Progress to Done on the Data-Engineering-Kanban board.Feb 22 2022, 12:20 PM

BTullis closed this task as Resolved.Feb 23 2022, 5:32 PM

Change 767507 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Correct the location of the MAE and MCE consumer jars

https://gerrit.wikimedia.org/r/767507

Change 767507 merged by Btullis:

[analytics/datahub@wmf] Correct the location of the MAE and MCE consumer jars

https://gerrit.wikimedia.org/r/767507

BTullis mentioned this in R3101:9966d6eb4d0a: Correct the location of the MAE and MCE consumer jars.Mar 2 2022, 2:20 PM

Change 767725 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Change the CWD of the datahub-frontend

https://gerrit.wikimedia.org/r/767725

Change 767725 abandoned by Btullis:

[analytics/datahub@wmf] Change the directory structure of datahub-frontend

Reason:

Blubber doesn't like extracting to the root directory

https://gerrit.wikimedia.org/r/767725

Change 767743 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Update the path of the user.props file

https://gerrit.wikimedia.org/r/767743

Change 767743 merged by Btullis:

[analytics/datahub@wmf] Update the path of the user.props file

https://gerrit.wikimedia.org/r/767743

BTullis mentioned this in R3101:56bd3972d800: Update the path of the user.props file.Mar 3 2022, 12:20 PM

Change 767783 had a related patch set uploaded (by Btullis; author: Btullis):

[integration/config@master] Add three more container build pipelines to datahub

https://gerrit.wikimedia.org/r/767783

Change 767783 merged by jenkins-bot:

[integration/config@master] Add three more container build pipelines to datahub

https://gerrit.wikimedia.org/r/767783

Change 767837 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Fix the entrypoint

https://gerrit.wikimedia.org/r/767837

Change 767837 merged by Btullis:

[analytics/datahub@wmf] Fix the entrypoint

https://gerrit.wikimedia.org/r/767837

BTullis mentioned this in R3101:87010f39e569: Fix the entrypoint.Mar 3 2022, 4:47 PM

Change 767847 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Edit the entrypoint and setup scripts

https://gerrit.wikimedia.org/r/767847

Change 767847 merged by Btullis:

[analytics/datahub@wmf] Edit the entrypoint and setup scripts

https://gerrit.wikimedia.org/r/767847

BTullis mentioned this in R3101:8f1991b594eb: Edit the entrypoint and setup scripts.Mar 3 2022, 6:03 PM

Change 773479 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Tag using date-stage as well as latest

https://gerrit.wikimedia.org/r/773479

Change 773479 merged by Btullis:

[analytics/datahub@wmf] Tag using date-stage as well as latest

https://gerrit.wikimedia.org/r/773479

BTullis mentioned this in R3101:b50516ceff39: Tag using date-stage as well as latest.Mar 24 2022, 12:31 PM

Change 773500 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Use git commit SHA for image label

https://gerrit.wikimedia.org/r/773500

Change 773500 merged by Btullis:

[analytics/datahub@wmf] Use git commit SHA for image label

https://gerrit.wikimedia.org/r/773500

BTullis mentioned this in R3101:59b615a2b6db: Use git commit SHA for image label.Mar 24 2022, 1:57 PM

Change 778224 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Add wmf-certificates to each datahub container

https://gerrit.wikimedia.org/r/778224

Change 778224 merged by jenkins-bot:

[analytics/datahub@wmf] Add wmf-certificates to each datahub container

https://gerrit.wikimedia.org/r/778224

BTullis mentioned this in R3101:b0e4ef798e03: Add wmf-certificates to each datahub container.Apr 7 2022, 9:25 AM

BTullis closed subtask T301454: Define the Helm charts and helmfile deployments for Datahub as Resolved.Apr 13 2022, 10:36 AM

BTullis mentioned this in R3101:9f9311d21a8d: Add WMF customization of datahub.Apr 13 2022, 11:30 AM

BTullis mentioned this in R3101:415bf1dc729c: Add WMF customization of datahub.Apr 13 2022, 2:34 PM

Change 779888 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Add WMF customization of datahub

https://gerrit.wikimedia.org/r/779888

Change 779888 merged by jenkins-bot:

[analytics/datahub@wmf] Add WMF customization of datahub

https://gerrit.wikimedia.org/r/779888

BTullis mentioned this in R3101:db8b5c547e17: Add WMF customization of datahub.Apr 13 2022, 3:26 PM

BTullis mentioned this in R3101:18d3e4c23c19: Add WMF customization of datahub.May 17 2022, 5:06 PM

BTullis mentioned this in R3101:f8918e627a8a: Add WMF customization of datahub.Jun 10 2022, 4:11 PM

BTullis mentioned this in R3101:4fd900822bbd: Add WMF customization of datahub.Aug 26 2022, 11:55 AM

BTullis mentioned this in R3101:db8c787ed0ea: Add WMF customization of datahub.Oct 31 2022, 5:23 PM

Stevemunene mentioned this in R3101:60d666784d30: Add WMF customization of datahub.Mar 8 2023, 9:54 AM

Maintenance_bot removed a project: Patch-For-Review.Mar 8 2023, 10:10 AM

Stevemunene mentioned this in R3101:9593040a6d09: Add WMF customization of datahub.Mar 8 2023, 11:27 AM

Change 900310 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Experimental refactor of the datahub container build process

https://gerrit.wikimedia.org/r/900310

gerritbot added a project: Patch-For-Review.Mar 16 2023, 1:13 PM

Change 900310 abandoned by Btullis:

[analytics/datahub@wmf] Experimental refactor of the datahub container build process

Reason:

Experiment concluded

https://gerrit.wikimedia.org/r/900310

Maintenance_bot removed a project: Patch-For-Review.Jun 29 2023, 1:11 PM

BTullis mentioned this in R3101:742192118a22: Add WMF customization of datahub.Jul 5 2023, 11:13 AM

Create DataHub containers with deployment pipelineClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Create DataHub containers with deployment pipeline
Closed, ResolvedPublic
Actions

Related Objects
Search...