Skip to content

Web development and deployment

Introduction

Note: This document describes the new situation after the upcoming migration in April 2023.

CLST offers various web-based services via our webservice portal page at https://webservices.cls.ru.nl. In this document we explain in detail how you can develop and deploy such web-based services. It does not apply to simple (static) websites.

All web-based services are hosted on a separate production server called lightning. We make an explicit distinction between development and production scenarios. The former is when you as a developer develop and test your service; at this point is not accessible to the general public and you can use any ponyland server for development. The second is when your service or application is deployed for the end-user; as mentioned we have a dedicated server for this.

Web frameworks

We recommend use of one of the following frameworks for web applications and webservices:

  • CLAM - CLAM enables you to quickly build a RESTful webservice with the added bonus of offering a generic web user interface for human end-users. It is well suited for longer running batch-tasks. It also integrates nicely with the CLARIAH/CLARIN authentication infrastructure without having to do much, this will be explained in a later section. Many of the webservices are powered by CLAM. CLAM basically wraps itself around your underlying back-end system, which may be a command-line tool.
  • Django - A well-known Python-based web framework.
  • Flask - Another well-known Python-based web framework.

CLAM

CLAM wraps itself around your underlying back-end system, which may be a command-line tool. Building a CLAM webservice consists of:

  1. writing a service specification that specifies what the expected input, output and parameters of your system are.

  2. writing a wrapper script that invokes your underlying system.

Using CLAM is not a requirement, if you prefer to write a webservice from scratch any language of your choice then that is of course an option too. CLAM will, however, probably save you a lot of work so you can focus on the actual underlying system instead.

To work with CLAM, we refer to the CLAM documentation. CLAM is written in Python and can be installed with a simple pip install clam, preferably in a python virtual environment (see next section).

You can then use the clamnewproject command to start a new clam project, which will guide you through the process.

Development Environment

You can use any ponyland server as your development environment. For all Python-based solutions, it is strongly recommended you work in your own python virtual environment, you may create one from scratch as follows:

$ python3 -m venv env

The last argument is the name for the virtual environment, you can pick anything you like. A directory of that name will be created in your current working directory, containing the virtual environment. The environment is subsequently activated as follows, you will need to this every time you open a new shell and want to work with the environment:

$ source env/activate

Once in the environment (your prompt should usually change to indicate your are inside), install a framework such as CLAM (or django or flask) as follows:

$ pip install clam

We used to recommend LaMachine for development environments, but that solution, however, was deprecated in 2022.

Accessing your webservice during development

Web frameworks usually come with a development webserver. This is only meant for development, it typically runs without any encryption (HTTP instead of HTTPS).

  • For CLAM: Projects by default come with a script called startserver_development.sh to start the server in development mode. This script simply installs the webservice and invokes CLAM to run it locally, as follows:

    $ python setup.py develop $ clamservice -d yourservice.yourservice

  • For Django:

    $ django-admin runserver

Both will inform you on what host and port the development webserver runs, depending on how you configured it. If you run it locally on your own system, you can simply point your browser to this address to test the webservice and you're done.

However, if you run the development webserver on a ponyland server, then things are complicated somewhat by the fact that our firewall does not allow direct access to it. This shouldn't bother you if run already run a VNC session on the server

If you are outside the Radboud network, you first need to connect over VPN. Then you can connect to the ponyland server at the advertised port, in your browser, e.g. something like http://mlp02.science.ru.nl:12345.

Advanced: SSH Tunnels

Alternatively for advanced users with SSH access to lilo.science.ru.nl, create a double ssh tunnel:

$ ssh -L 8888:localhost:9999 yourusername@lilo.science.ru.nl ssh -L 9999:localhost:8888 -N mlp02

We use 9999 as an intermediate port (can be any number as long as its not in use). This tunneling mechanism is explained in detail here This procedure will fail if a port is already in used by

If everything went well, you can connect to https://127.0.0.1:8888 to access your web-based service/application.

Deployment in production

Once your application is mature enough and you have tested it in your own development environment, we can deploy it and make it available on https://webservices.cls.ru.nl.

To make this happen, you provide an OCI/Docker container image (or a Dockerfile that builds it) and we deploy it.

  • This solution adheres to the CLARIAH Software Requirements and CLARIAH Infrastructure Requirements.
  • It ensures your software can be easily deployed anywhere (at least by all CLARIAH infrastructure providers).
  • If you use CLAM (>=3.2), a Dockerfile template was generated for you already.
  • If your application uses multiple containers (container orchestration), provide a docker-compose.yml.
  • This is the typically the preferred and easiest solution. Especially when using CLAM and using the CLARIAH authentication infrastructure.
  • Your docker container will run inside the webservices LXC/Incus container on lightning, you have no SSH access (nor should you need it).
  • Your application will be deployed as https://webservices.cls.ru.nl/your_application.
  • For this option, contact us with the following information:
    1. Your docker image (preferably via Docker Hub, Quay.io or any other container registry)
    2. URL to the source git repository where your application's source code (incl. the Dockerfile) resides. This will be used to extract metadata for the portal page.
    3. If you use CLARIAH's authentication infrastructure (recommended), pass any OAuth2 endpoints, we will provide you with a client ID and client secret. If you use CLAM, you can skip this as we can derive this automatically for you.

Note: There used to be an option using LaMachine, that solution, however, was deprecated in 2022.

Authentication

Your web applications likely needs an authentication layer to be secure. This means someone is responsible for keeping a database with users and their credentials and authorizations, i.e. act as an identity provider. Here there are three solutions, in order of recommendation:

  1. Participate in the CLARIAH and CLARIN authentication infrastructure.
  2. Rely on other identity providers (e.g. Google, Github, Gitlab).
  3. Use your own authentication solution and identity provider (i.e. manage a user database yourself, this is the naive approach).

Prior to April 2023, we maintained our own user database and registration procedure. Shared by numerous services. This is now deprecated.

Participate in the CLARIAH and CLARIN authentication infrastructure

Most of the services on https://webservices.cls.ru.nl participate in the CLARIAH and CLARIN infrastructures, which means that they rely on external infrastructure for things like authentication. This allows researchers to login with their institutional credentials. This requires the institute to be part of the CLARIN Service Provider Federation. The Radboud University and many dutch universities and research institutes are part of this. If you have users that do not belong to an institute, they may register separately with CLARIN. This, however, is not an instantaneous process as there is a human in the loop that reviews such applications.

Authentication proceeds over OpenID Connect (OAuth2) and is described here.

Authentication using other identity providers

This typically also involves OpenID Connect (OAuth2) like the above solution. You can use established big-tech identity providers such as Google, Microsoft, Github, Facebook, etc.. Do note that this has privacy implications. The identity provider will know who accesses your application and when he/she does so.

Version control: Git

It is always strongly recommended to keep your webservice's source code in version control; we typically use git. You can use for example on Github or our private Gitlab instance.

Take note of the following:

  • The need to for git repositories applies to both the code using a web framework like CLAM, as well as any potential back-end systems you are developing. You may combine subsystems in a single git repository, or use separate ones for each.
  • Do not ask us to deploy something just by pointing at a location on ponyland where you're developing it.
  • The contents of python virtual environment you may be using should not be checked in under version control.
  • If your underlying system consists of large models (say anything over 50-100MB) then those should not be added to git either, but offered separately. We offer a place on the server where you can place big archives (tar.gz, zip) for download, see downloads. Within the git repository for the underlying system, we recommend you add a small shell script that downloads these resources.

We recommend doing explicit version releases of your software at moments in time when it is deemed stable. Each version is represented by an appropriately named tag in the git repository, for example: v0.1.2. When you use github's release mechanism, these tags will be created for you automatically.

Your software should ideally be installable using a mechanism that fits the ecosystem the software is written in. For Python software for example, this means having a setup.py or pyproject.toml. Note that CLAM will create a template for one automatically when you run clamnewproject, so you need not worry about it.

Kaldi and Kaldi_NL

Various of our speech webservices make use of Kaldi. In addition, they often make use of models and scripts offered by Kaldi_NL on top of that.

Working with kaldi can be a bit daunting as it is a complex mess of various scripts that are not always very cleanly structured. Kaldi_NL is an effort to make this a bit more accessible.

If you intend to develop your own system and contribute back to Kaldi_NL, then it is recommended to fork the Kaldi_NL project on github and base your work on that (i.e. fork the project on github and add an extra git remote in e.g $LM_PREFIX/opt/kaldi_nl). You can later send a pull request to contribute your changes back and make them available for the wider research community. Please read the contributor guidelines and also check the sofware quality guidelines.

It may help to look at and copy from existing ASR projects that have been made available as a webservice. A good example is the Dutch ASR system, also known as "oral history", which is offered as an extra in Kaldi NL:

  1. Webservice source code: https://github.com/opensource-spraakherkenning-nl/asr_nl
  2. ASR decoder scripts: https://github.com/opensource-spraakherkenning-nl/Kaldi_NL
  3. The models are offered separately through a download link, see the download script in contrib/oral_history in the decoder repository.

Another example is the English ASR system, where the decoder scripts are not merged back into Kaldi_NL but kept as a standalon repository:

  1. Webservice source code: https://github.com/opensource-spraakherkenning-nl/eng_ASR
  2. ASR decoder scripts: https://gitlab.science.ru.nl/clst-asr/eng_asr_decoder
  3. The models are offered separately through a download link, see the download script in the decoder repository.

Do be aware that the CLAM-based webservices are not very suitable when you have low-latency requirements. Inversely, they are suitable for long-time batch processing.

Administrator Notes

This final section is for ponyland administrators only.

The private git repository at https://gitlab.science.ru.nl/ponyland-admin/webservices.cls.ru.nl contains the infrastructure code for all everything running in the webservices LXC/Incus container on lightning. Please see the README there for further information.