Clone this repo:

Branches

  1. 65f3753 Set python3.7 for Buster Pyspark Yarn kernels by Luca Toscano · 3 years, 10 months ago master
  2. dc82d01 Fix kernel json paths after directory rename by Luca Toscano · 4 years, 1 month ago
  3. 5cdbeae Fold the kernel's README into the main one and add documentation by Luca Toscano · 4 years, 1 month ago
  4. e307abb Fix display name of the spark_yarn_scala_large kernel by Luca Toscano · 4 years, 1 month ago
  5. 8a48169 Rename Spark Buster kernels by Luca Toscano · 4 years, 1 month ago

To build/update this deploy repo, first use pip to install all dependencies as wheels into the wheels/ directory. This is done by the build_wheels.sh script. Then make a commit to include new files in artifacts/.

./build_wheels.sh git add artifacts git commit -m 'Updating wheels for jupyterhub and dependencies' git review

create_virtualenv.sh will be run either by Puppet (during first installation) or should be run manually when build depdendencies are updated. On puppetized notebook servers, this should be run as:

./create_virtualenv.sh /srv/jupyterhub/venv

create_virtualenv.sh will also install any global Jupyter Kernels that all users should have access to and configured in the same way. These kernels are installed from pre-made kernels in the kernels/ directory. See kernels/README.me for more information.

NOTE: We need Apache Toree 0.2.0+ for Spark 2 support, but this was not available via pip when the repository was created. The file https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0-incubating-rc5/toree-pip/toree-0.2.0.tar.gz was downloaded, and a frozen requirement wheel was built from it. This has not been changed for Stretch kernels, meanwhile for Buster we currently use Toree 0.3.0 provided by pip wheels.

The spark_*_{scala,sparkr,sql} kernels are all Apache Toree Kernels. They were originally created and then edited with the following commands/. We chose not to use the Apache Toree Python Spark kernel, as it does not work as well as the regular ipython one.

The spark_*_pyspark kernels are regular ipython kernels that run a pyspark shell.

NOTE-2: The following commands were run, but the resulting kernel.json files were edited to make the display_names more readable. Some logo- .png files were also added.

NOTE-3: When you run the following command you need to keep in mind two things:

  1. the name of the target directory of the kernel will pick a name from "kernel_name", so try to keep it in line with what you find in the "kernels" directories.
  2. the display name of the kernel (namely what people see in the Jupyter UI's dropdown) needs to be changed according to the convention used for other kernels via the kernel.json file.
source ./profile.sh
venv=${1:-$(realpath ${base_dir}/../jupyterhub-venv)}


# Spark Local kernels
$venv/bin/jupyter toree install --replace \
--python_exec /usr/bin/python3 \
--spark_home="/usr/lib/spark2" \
--log-level=INFO \
--kernel_name "Jupyter Spark Local" \
--spark_opts="--name='Jupyter Spark Local' --master local[4] --driver-memory 6g --jars hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-job.jar,hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-hive.jar" \
--interpreters="Scala"

$venv/bin/jupyter toree install --replace \
--python_exec /usr/bin/python3 \
--spark_home="/usr/lib/spark2" \
--log-level=INFO \
--kernel_name "Jupyter SparkSQL Local" \
--spark_opts="--name='Jupyter SparkSQL Local' --master local[4] --driver-memory 6g --jars hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-job.jar,hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-hive.jar" \
--interpreters="SQL"

$venv/bin/jupyter toree install --replace \
--python_exec /usr/bin/python3 \
--spark_home="/usr/lib/spark2" \
--log-level=INFO \
--kernel_name "Jupyter SparkR Local" \
--spark_opts="Jupyter SparkR Local' --master local[4] --driver-memory 6g" \
--interpreters="SparkR"

# Spark YARN kernels
$venv/bin/jupyter toree install --replace \
--python_exec /usr/bin/python3 \
--spark_home="/usr/lib/spark2" \
--log-level=INFO \
--kernel_name "Jupyter Spark" \
--spark_opts="--name='Jupyter Spark' --master=yarn --driver-memory 2g --executor-memory 8g --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=64 --jars hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-job.jar,hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-hive.jar" \
--interpreters="Scala"

$venv/bin/jupyter toree install --replace \
--python_exec /usr/bin/python3 \
--spark_home="/usr/lib/spark2" \
--log-level=INFO \
--kernel_name "Jupyter Spark Large" \
--spark_opts="--name='Jupyter Spark Large' --master=yarn --driver-memory 4g --executor-memory 8g --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=128 --jars hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-job.jar,hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-hive.jar" \
--interpreters="Scala"

$venv/bin/jupyter toree install --replace \
--python_exec /usr/bin/python3 \
--spark_home="/usr/lib/spark2" \
--log-level=INFO \
--kernel_name "Spark YARN" \
--spark_opts="--name='Jupyter SparkSQL' --master=yarn --driver-memory 2g --executor-memory 8g --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=64 --jars hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-job.jar,hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-hive.jar" \
--interpreters="SQL"

$venv/bin/jupyter toree install --replace \
--python_exec /usr/bin/python3 \
--spark_home="/usr/lib/spark2" \
--log-level=INFO \
--kernel_name "Jupyter SparkR" \
--spark_opts="--name='Jupyter SparkR' --master=yarn --driver-memory 2g --executor-memory 8g --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=64" \
--interpreters="SparkR"