To save myself an hour next time I need to install JupyterLab (the latest 2020 version of Jupyter) here’s a step-by-step install from scratch on self-hosted AWS, warts-and-all. Key points:
- We want to use jupyter-lab instead of the legacy jupyter-notebook
- The notebooks/labs must be private, it’s absurd to think otherwise
- We want online / cloud access to our work
- We want the minimum of effort to install and get working with Jupyter
- We want the minimum complexity to maintain the installation
Core investigations + conclusions
Hosting: Use AWS
I researched many sources of 3rd-party hosting for Jupyter and they all … well, sucked. There was a great article on free hosting – but all of those forced your private data to be placed in public for anyone/everyone to take.
I looked at the ones that had private “upgrades” available from their free/public tier, but they came with intensely complicated procedures (e.g. you lose access to all your data) and confusing and difficult pricing (mostly designed for ML programmers, but irrelevant to people not doing ML).
Finally I went through the most widely-referenced “Jupyter hosting” companies I found via Google and forums posts and reddit etc etc. Some of these looked good, but nearly all of them were running either legacy Jupyter (which was superceded 2 years ago!) or their own custom “this isn’t jupyter, it’s a thing-that-is-a-bit-like-jupyter, missing core features, with our own proprietary changes” which is undesirable. Poster-child there was Google who – yet again – created a pointlessly proprietary system that you’re locked into, and which has a high probabilty Google will shut it down and delete all your data with no upgrade option (as they are currently doing multiple times a year :)).
Eventually I came back to self-hosting: how hard would this be? There are multiple guides on this, both from 3rd parties (some of them broken/incorrect with key steps missing) and from the official Jupyter website. Custom instructions were provided for AWS which was a good sign (along with GCE, Microsoft cloud, etc) Reading through the instructions they were essentially:
“Steps 1-20: Setup a new AWS default cloud instance (identical to all AWS hosting.
Steps 21-25: Do a couple of Jupyter-specific post-install steps”
…with the vast majority of the setup work being AWS (which isn’t a great install process, but if you’ve used AWS you’ve already done this many many times and are comfortable with it), self-hosting seemed the best way forwards.
AWS: tiers and options
Confusingly there’s two branches of Jupyter self-hosting: Jupyter and JupyterHub. The latter is a multi-users-working-on-one-machine with each having their own loing username and personal password, with private or semi-private notebooks etc. For myself as a single user that was overkill – although the install instructions were even simpler than the main Jupyter (ironically).
None of the install guides were detailed (or useful) in their advice on picking an AWS instance, apart from the JupyterHub one. With some digging on reddit it turns out that 100MB or so of RAM should be more than enough for running notebooks, with “as much more as the size of data you intend to hold in RAM”. If you’re coding very lazily you might try loading massive source data into RAM but … we’re sensible programmers, we don’t do that.
Summary: The smallest AWS instance - t2.nano - works fine with 0.5GB RAM, and the (2020) minimum EBS disk of 8GB SSD.
(If you hit the RAM limit: spin down your AWS instance, replace it with a t2.micro, attach the EBS to the new AWS, then spin it up again. This is exactly what cloud hosting was invented for! It’s easy, so no need to worry about it here)
UPDATE:
Jupyter doesn’t need much RAM.
But node.js – which a lot of UI-related Jupyter plygins rely on – requires gigabytes of RAM to work at all, and it crashes with useless error messages rather than handle its own errors.
It’s quite eye-opening how bad node.js code is (I believe this isn’t node.js itself, rather it’s an ecosystem and habits of people who write node.js apps – there’s nothing wrong with their choices, but they’ve prioritised embedding other people’s code they don’t understand (and mostly don’t need) rather than spend a few minutes writing simple code themselves. The net result is that if you want to install (or even reconfigure :() any of the JavaScript related plugins (which is all of the UI ones) then you’ll need to temporarily boost your AWS instance to a larger instance each time, and then downgrade it after the upgrade. 2GB RAM is recommended for node.js. To be clear: for actually using Jupyter, 200MB is more than enough – i.e. node.js alone requires 10x as much RAM as the entire system you’re running!
AWS install summary
If you’ve created EC2 instances before, the short version here is:
- A small EC2 instance
- Running standard ubuntu (18.04 or 20.04 as of this writing)
- A security-group which unblocks SSH (for manual install tasks) + one port (for web access)
Traditionally a lot of people unblock port 8888 but there appears to be no reason for this other than installation being done by people who don’t really know what they’re doing with web server configs. A bit like the node.js setup – people copying random pieces of instructions without actually reading them. In months of usage, I’ve found no ports need to be unblocked (why create a security hole when you don’t need to?).
Jupyter initial install
Installing Jupyter core is straightforward … and broken in ubuntu 18.04 (the main ubuntu release when I did this in early 2020, unless you’ve upgraded to 20.04 already). By default it will fail to install – this appears to involve known bugs, but the workarounds are so quick to do that no-one is in a rush to fix it. I’m assuming that in ubuntu 20.04 it’s been fixed.
Step 1: Update Ubuntu
Do your apt updates etc (I prefer to use aptitude for all apt management so I can see what’s happening and make more informed choices about versions etc – for me it’s hitting ‘u’ and let aptitude do it automatically).
Step 2: install jupyter
The apt is named ‘jupyter’ and should automatically bring in all the required modules, python3, pip, etc. It should also setup a basic install of jupyter with Python notebooks enabled etc. The magic of debian/apt!
Step 3: install the python parts of Jupyter (specifically: JupyterLab)
Here’s where it goes python-y and stops working. Unfortunately Ubuntu does not (yet) have an apt for JupyterLab that actually works – and it has to be installed over the top of a legacy Jupyter installation (that we already have thanks to parts 1 and 2).
What follows is all standard for python developers, nothing new. But if you’re not a python developer … You have to “install” jupyterlab by using python’s in-built self-management systems, which aren’t as good as the OS ones. No more apt for you :(. For extra pain: python wants you to jump through hoops to keep multiple copies of itself on the system – because the packages aren’t managed well enough for the OS to use python and for you to use it at the same time.
In my case: I have a dedicated server that is doing nothing but running Jupyter and I have no intention of upgrading python on one without upgrading the other. I was happy to use a single Python install, and avoid a lot of problems and confusion. Worst case if one does something weird in a future release I’ll simply wipe the server and re-install – we’re in the decades of commodity cloud computring, and re-installing OS’s in seconds, not hours.
In theory (from the docs), you run:
pip3 install jupyterlab
In practice, that:
- Installs it in a hidden folder buried inside the home-directory of the current user
- Fails to install it correctly: it cannot work
- Leaves a different version of Jupyter on the system that is missing the new features
FAIL. Maybe related to the expectation that you have multiple Python installs – but I wasn’t going to mess around with that added complexity only to discover that it was broken anyway (if it doesn’t work out of the box, I don’t trust it to work out of a more complex box…).
The nearest I found to an official workaround appears to be: update your linux user’s PATH to make it preferentially run the hidden-secret-silently-mis-installed version of Jupyter. Do that and everything appears to work fine. Or … just remember to always type
/home/ubuntu/.local/bin/jupyter
everywhere that you would normally have typed:
jupyter
Step 4: configure jupyter to work correctly
Out of the box Jupyter won’t work on a server: it’s been deliberately designed to fail even if you correctly installed it (this is not a bad thing: it’s designed to be idiot-proof for people who install on their personal laptop). To make it run as a server you can either mess around wasting time doing ssh-tunneling (why? WHY??!! Why have so many online guides told people to do this? Blind leading the blind, it seems…).
…Or: you can simply enable the server mode :), which works fine and is easier to setup (and cleaner).
But at first: you cannot do that. Out of the box jupyter can’t even be configured: it’s missing it’s own config file. Fortunately it has a ‘feature’ where you can get it to auto-create the config file (why it doesn’t do this as part of installation I have no idea), but it’ll be placed somewhere super-annoying:
jupyter notebook --generate-config
…and helpfully it’ll immediately tell you where it created it, probably:
/home/ubuntu/.jupyter/jupyter_notebook_config.py
As per this stackoverflow answer (https://stackoverflow.com/a/43500232/153422) you only need to change two lines to enable server mode. One line allows server access, and the other says “listen on all IP addresses”. You can find the commented-out lines in the file and uncomment them + change their values, or you can just copy/paste the values from that SO answer into the bottom of the file.
While you’re there, it’s worth changing the default values / inserting:
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888 (whatever you unblocked in AWS security group)
Step 5: run Jupyter-lab and login for the first time
To run jupyter-lab you need something like this:
~/.local/bin/jupyter-lab
Three things now happen:
- Jupyter is up and running and spits out lots of info to the command-line. Check it for errors – there shouldn’t be any
- It specifically tells you which IP addresses it’s listening on.
- It gives you a magic, temporary, token to login with.
The IP address list is wrong, but … at least it will show you that it’s listening to more than just localhost and 127.0.0.1 (if it’s only listening to them then you failed to edit the config properly).
It tries to guess the info using OS lookups, but they did it naively and they did it wrong: they will fetch private addresses that don’t exist, instead of the public ones. But if you’ve used AWS before you know how to get the public IP and/or public DNS from your EC2 management console (in AWS management console, select the EC2 instance, and click on the “Connect” button and you get a popup telling you exactly how).
So use the correct IP/DNS, go directly to that server / port address (ignore the ?token rubbish that the jupyter commandline app wanted you to use). You’ll get a login page where you can copy/paste the token from the commandline output and immediately login.
Or, better: jupyter’s login page helpfully gives you the option here to use the temporary token to generate a password you can use in future instead.
Step 6: Switch to TLS/HTTPS (make the web-browser connection secure)
Don’t self-sign, self-signing is being aggressively blocked by web-browser vendors (Google, Mozilla, Apple, etc) – again: ignore the bad advice and articles written by Jupyter users who don’t know what they’re doing. Instead use the free LetsEncrypt service for industry-standard automatically renewing signed certificates that you can fire-and-forget.
Follow the jupyter main docs directly:
https://jupyter-notebook.readthedocs.io/en/stable/public_server.html#using-lets-encrypt
Step 7 (final!): make Jupyter run as a service
This part shows how poorly this python app is integrated with the host OS – Jupyter doesn’t run as a service. You have to manually convert it into one.
Most people do this the brutally simple way: either run the command line with ampersand (i.e. linux’s “run-in-background but if you lose your ssh connection it might die”) or run it with ‘screen’ (linux’s more advanced multi-tasking app that makes it easy to re-access later and will survive even if you lose SSH connection or logout of ubuntu).
Much much better would be to convert it into a full ubuntu service – I googled the latest recommended instructions for this and followed them, but sadly I didn’t save the URL. It depends slightly whether you used new-style ubuntu (with systemd) or legacy (init) – but creating new ubuntu services is very common and quite easy to do so I’ll leave that for you to Google and find your preferred approach :).