Page MenuHomePhabricator

[DRAFT][RfC] Deployment of python applications in production
Open, MediumPublic

Description

This document is a draft. It is shared to gather some early feedback

Introduction

Deploying applications to an enviroment with certain security and reproducibility requirements can be challenging. This is also true for python applications, whose "standard" deploy mechanism of using a virtualenv + pip downloading dependencies from a requirments file is both insecure and prone to multiple kinds of failure. Those are the reasons why that deployment method is forbidden on the Wikimedia cluster and the golden standard of deployment for python applications is creating debian packages. There are situations, though, in which the use of a virtualenv is unavoidable, e.g. due to the application requiring a more updated library than what the distribution offers, or it downright using some libraries that would conflict with what is on the system. For such cases, there is no standard set of rules, although some have been worked out along the way.
This document sets a series of guidelines about how to deploy python applications to the Wikimedia Foundation servers. It focuses on combining ease of build for the developer, ease of deploy for the deployer, reproducibility and security.

Which deployment method to choose

By default, any python application is expected to be packaged as a debian package and to use distribution-provided libraries. Any case when this is not happening needs to be thoroughly justified, as it will pose a burden on both the ops team and on the deployers of said software. The case can be made for a different deployment strategy when a software deployment meets all, or most of, the following criteria:

  1. Has a proven need of features in a library version that's not available for the system, and whose API has changed enough that backporting the library as a debian package could break other things.
  2. Has more than N libraries not currently available as debian packages, that we would need to package ourselves (TODO: figure out a sensible value for N - I would say 3 is the magic number)
  3. Needs to be deployed as a service (so depool - deploy - restart - test - pool, canary deployments, etc)

Application deployment and build process

In case a debian package is not an option, the following prescriptions apply:

  1. It must be distributed via scap3, using a <package-name>/deploy repository
  2. It must provide wheels for the software and any external library that needs to be installed
  3. It must store the wheels on an artifact repository, and serve them in production via git-fat or git-lfs.
  4. It should build said wheels from a frozen-requirements.txt file provided in the deployment repository, via a standardized CI job
  5. It should sanitize said wheels making them reproducible by using a tool like strip-nondeterminism
  6. It should be installed to a dedicated virtual environment that will be located at /srv/deployment/<package-name>/venv
  7. If deployment via scap3 is used just to have service coordination/depool/restart, the virtual environment should make use of system libraries as much as possible.

Structure of the deploy repository

In order to avoid reinventing the wheel (pun intended) every time, the deploy repositories should adhere to a strict structure so that all the glue needed for build and deployment of the application can be standardized and some scaffolding script can be created, removing all the burden from the developer about learning the details of how the build process works, or even implementing their own flavour of it.

The repository should look as follows:

  1. It must include a frozen-requirements.txt file, where all dependencies are listed, with frozen versions.
  2. It must include an artifacts directory where wheels should be stored via git-fat or git-lfs in subdirectories indicating which distribution they were built for.
  3. It must include a scap check to run in the promote stage that will refresh the virtualenv if needed, and will install all the wheels to it.
  4. It should not include a build directory, in fact it should ignore any build directory that it might contain.

If the source code is developed or modified by us:

  1. It must include a reference to the git tag or tree-ish we want to release - either in a src_version file or as a git submodule
  2. The source code must to be either cloned or located in the src subdirectory.

Example implementation

Most of the concepts expressed here (but definitely not all!) are implemented in the operations/docker-images/docker-pkg-deploy repository, in particular its Makefile.build can be used as a basis for setting up a standardized CI job for building and uploading the artifacts.

Event Timeline

I would argue that including the source of the software as a submodule should be optional. The specific use case I have in mind is the deployment of mapzen, where all sources are external and all modules are downloaded.

Which deployment method to choose

I would mention also cases in which the upstream package or dependencies release quite often, like for example web apps.

Structure of the deploy repository

  1. It must include a frozen-requirements.txt file, where all dependencies *not* provided by the system, with frozen versions.
  • How can we, at build time, know which dependencies and which version of them are available in the final systems?
  • What about multi-OS deployments, that will surely happen when migrating to a new OS like the jessie->stretch migration? The system versions will be surely different.
  • If the frozen-requirements.txt file has a pinned version (==1.2.3) for each dependency, as I expect it to have, it will be very unlikely that the specific version will be available as a system package, even for the more common ones like pyyaml.

So it seems to me that either we go towards using as much as possible system packages by not pinning to specific versions in the frozen-requirements.txt file but having only min/max version pinning, or we go towards a strictly pinned version of the requirements not using system packages.

Structure of the deploy repository

  1. It must include the source code for the software in a src subdirectory, which should be a git submodule

We need to consider also cases in which we deploy an external application that is not developed/patched by us. I don't think in this case we really need to import the whole repository and we could consider the src directory optional in that case and support a dependency-only case (i.e. netbox).

Example implementation

I think we should mention that the CI must run the tests on the built environment with all the dependencies installed. Here again there is the problem of the system packages, if we go in that direction, to ensure that we test with the same set of system packages.

General comments

The main problem that this approach doesn't solve, unfortunately, and for what I don't think there is an easy solution, is the security point of view.
Both PyPI and GitHub releases have no guarantee that the code was not tampered with and also that no spurious commit was included in the release.

For the first part we could implement some validation when any of these are available:

  • md5 of the PyPI package (always available)
  • PyPI packages have been released with their GPG signature
  • GitHub releases have uploaded also a GPG signature of the tar.gz package
  • the git tag of the release was signed by a GPG

I'm doing all of the above for Cumin, but I know that not many projects are currently doing it.
This is also assuming of course that we keep an up-to-date list of valid GPG keys that can sign that specific package, just verifying that is a valid signature is probably not enough.

The second part is much more tricky and could hardly be done without manual checking of the commit history and content. One could also argue that even the Debian packages might not be a protection for this, because it depends how thoroughly the debdiff will be checked, depending also in its size.

I would argue that including the source of the software as a submodule should be optional. The specific use case I have in mind is the deployment of mapzen, where all sources are external and all modules are downloaded.

Well if we don't include a submodule with the sources, I would expect us to clone a repository at a certain revision at the very least, or to create a safe way to download artifacts.

What did you have in mind for mapzen?

Which deployment method to choose

I would mention also cases in which the upstream package or dependencies release quite often, like for example web apps.

Uhm, I would rather say that things that need coordinated deployment would need to use scap3 *and* system python packages. Which is practically speaking the same thing as a debian package.

  • How can we, at build time, know which dependencies and which version of them are available in the final systems?

packages.debian.org? CI? I would expect us to be able to manage that.

  • What about multi-OS deployments, that will surely happen when migrating to a new OS like the jessie->stretch migration? The system versions will be surely different.

If versions are different and incompatible, we can assume we *don't* have that library available on all systems.

  • If the frozen-requirements.txt file has a pinned version (==1.2.3) for each dependency, as I expect it to have, it will be very unlikely that the specific version will be available as a system package, even for the more common ones like pyyaml.

So it seems to me that either we go towards using as much as possible system packages by not pinning to specific versions in the frozen-requirements.txt file but having only min/max version pinning, or we go towards a strictly pinned version of the requirements not using system packages.

You know very well what advantages using system libraries gives us: security first of all, but this is indeed a good point. I would say we change the wording to be less stringent and add a note about pinned requirements.

Structure of the deploy repository

  1. It must include the source code for the software in a src subdirectory, which should be a git submodule

We need to consider also cases in which we deploy an external application that is not developed/patched by us. I don't think in this case we really need to import the whole repository and we could consider the src directory optional in that case and support a dependency-only case (i.e. netbox).

You mean you want to download the main application from pypi too? I don't really see that being very useful - I expect us to have to do tailor-made changes to the sources.

Example implementation

I think we should mention that the CI must run the tests on the built environment with all the dependencies installed. Here again there is the problem of the system packages, if we go in that direction, to ensure that we test with the same set of system packages.

I agree about CI, although it could be a tricky goal. We won't be able to use tox if we want to run tests in the built environment (or maybe we can with some magic), so it's not clear to me how we can do it properly and in a general way.

General comments

The main problem that this approach doesn't solve, unfortunately, and for what I don't think there is an easy solution, is the security point of view.

Yeah, there is no real solution there. Debian guarantees the identity and integrity of its packages properly, PyPi does not, and there is no way around that.

For the first part we could implement some validation when any of these are available:

  • md5 of the PyPI package (always available)
  • PyPI packages have been released with their GPG signature
  • GitHub releases have uploaded also a GPG signature of the tar.gz package
  • the git tag of the release was signed by a GPG

I would say these are implementation details for the common Makefile, I'm not sure we should add these details about github/pypi in the RfC.

Also, I think you're missing the main security benefit we're going to miss: automatic security patches for all libraries. That's the real, short and long term security threat we don't manage. I am particularly worried about abandonware lying on its side on our servers without anyone fixing it.

There are solutions to this problem, but all of them are painful and still require someone to follow up on the service/software. I still think these considerations could go into a "security considerations" section.

Well if we don't include a submodule with the sources, I would expect us to clone a repository at a certain revision at the very least, or to create a safe way to download artifacts.

What did you have in mind for mapzen?

I would expect us to only have a deploy repo, which we can clone to our deployment server at a specific revision. But since all mapzen code is available on pypi and we have no reason to modify it, I would expect to download all of it from there. We might actually be talking about the same thing here...

You mean you want to download the main application from pypi too? I don't really see that being very useful - I expect us to have to do tailor-made changes to the sources.

For what we're looking at for maps, I'm not expecting any local-only code changes. The only time we'd be likely to want to install from a different source is testing stuff.

Ottomata triaged this task as Medium priority.Jan 16 2018, 8:10 PM