To build/update this deploy repo, first use pip to install all dependencies as wheels into the wheels/ directory. This is done by the build_wheels.sh script. Then make a commit to include new files in artifacts/.
./build_wheels.sh git add artifacts git commit -m 'Updating wheels for jupyterhub and dependencies' git review
create_virtualenv.sh will be run either by Puppet (during first installation) or should be run manually when build depdendencies are updated. On puppetized notebook servers, this should be run as:
./create_virtualenv.sh /srv/jupyterhub/venv
create_virtualenv.sh will also install any global Jupyter Kernels that all users should have access to and configured in the same way. These kernels are installed from pre-made kernels in the kernels/ directory. See kernels/README.me for more information.
NOTE: We need Apache Toree 0.2.0+ for Spark 2 support, but this was not available via pip when the repository was created. The file https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0-incubating-rc5/toree-pip/toree-0.2.0.tar.gz was downloaded, and a frozen requirement wheel was built from it. This has not been changed for Stretch kernels, meanwhile for Buster we currently use Toree 0.3.0 provided by pip wheels.
The spark_*_{scala,sparkr,sql} kernels are all Apache Toree Kernels. They were originally created and then edited with the following commands/. We chose not to use the Apache Toree Python Spark kernel, as it does not work as well as the regular ipython one.
The spark_*_pyspark kernels are regular ipython kernels that run a pyspark shell.
NOTE-2: The following commands were run, but the resulting kernel.json files were edited to make the display_names more readable. Some logo- .png files were also added.
NOTE-3: When you run the following command you need to keep in mind two things:
source ./profile.sh venv=${1:-$(realpath ${base_dir}/../jupyterhub-venv)} # Spark Local kernels $venv/bin/jupyter toree install --replace \ --python_exec /usr/bin/python3 \ --spark_home="/usr/lib/spark2" \ --log-level=INFO \ --kernel_name "Jupyter Spark Local" \ --spark_opts="--name='Jupyter Spark Local' --master local[4] --driver-memory 6g --jars hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-job.jar,hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-hive.jar" \ --interpreters="Scala" $venv/bin/jupyter toree install --replace \ --python_exec /usr/bin/python3 \ --spark_home="/usr/lib/spark2" \ --log-level=INFO \ --kernel_name "Jupyter SparkSQL Local" \ --spark_opts="--name='Jupyter SparkSQL Local' --master local[4] --driver-memory 6g --jars hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-job.jar,hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-hive.jar" \ --interpreters="SQL" $venv/bin/jupyter toree install --replace \ --python_exec /usr/bin/python3 \ --spark_home="/usr/lib/spark2" \ --log-level=INFO \ --kernel_name "Jupyter SparkR Local" \ --spark_opts="Jupyter SparkR Local' --master local[4] --driver-memory 6g" \ --interpreters="SparkR" # Spark YARN kernels $venv/bin/jupyter toree install --replace \ --python_exec /usr/bin/python3 \ --spark_home="/usr/lib/spark2" \ --log-level=INFO \ --kernel_name "Jupyter Spark" \ --spark_opts="--name='Jupyter Spark' --master=yarn --driver-memory 2g --executor-memory 8g --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=64 --jars hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-job.jar,hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-hive.jar" \ --interpreters="Scala" $venv/bin/jupyter toree install --replace \ --python_exec /usr/bin/python3 \ --spark_home="/usr/lib/spark2" \ --log-level=INFO \ --kernel_name "Jupyter Spark Large" \ --spark_opts="--name='Jupyter Spark Large' --master=yarn --driver-memory 4g --executor-memory 8g --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=128 --jars hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-job.jar,hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-hive.jar" \ --interpreters="Scala" $venv/bin/jupyter toree install --replace \ --python_exec /usr/bin/python3 \ --spark_home="/usr/lib/spark2" \ --log-level=INFO \ --kernel_name "Spark YARN" \ --spark_opts="--name='Jupyter SparkSQL' --master=yarn --driver-memory 2g --executor-memory 8g --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=64 --jars hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-job.jar,hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-hive.jar" \ --interpreters="SQL" $venv/bin/jupyter toree install --replace \ --python_exec /usr/bin/python3 \ --spark_home="/usr/lib/spark2" \ --log-level=INFO \ --kernel_name "Jupyter SparkR" \ --spark_opts="--name='Jupyter SparkR' --master=yarn --driver-memory 2g --executor-memory 8g --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=64" \ --interpreters="SparkR"