PeerTube is a powerful open source platform that allows institutions to quickly set up a feature rich portal to grant users access to their video collections. For video content containing spoken audio, accessibility can be further enhanced with the use of subtitles or closed captions:
- Making spoken audio accessible to viewers who are hearing impaired
- Making spoken audio accessible to viewers who aren’t native in the spoken language
- Extracting keywords and annotations can help to find videos, based on the words mentioned within
Creating subtitles manually involves a lot of human work, which makes it less feasible in the context of audiovisual archives with large collections. An increasingly popular alternative is to transcribe spoken audio using automatic speech recognition (ASR). As part of this project, we have explored the use of ASR transcription to generate subtitles for videos published via our PeerTube instance.
Have a look at this video about ‘The work of the harbour cleaning service’ for an example of a PeerTube video that is fully subtitled using the workflow described in this post.
This post walks through the process of generating ASR transcripts for Dutch spoken videos using Kaldi NL, converting them into the right subtitle format for PeerTube and adding them to their respective videos on the PeerTube instance.
Installing Kaldi NL
An open-source solution available for ASR in Dutch is Kaldi NL, which builds on the broader Kaldi ASR software. The easiest way to install Kaldi NL is to use the bootstrapping script included in LaMachine. LaMachine is a linguistics toolkit which supports the installation of a rich selection of tools, but can also be customized to only set up a barebones installation of Kaldi NL and its requirements.
Installation of the required Kaldi packages is most easily done on commonly used Linux distributions. In our case we used Ubuntu 20.04.2 LTS. For the full installation notes please refer to the installation instructions included in the repository.
Generating ASR transcripts using Kaldi NL
The first step is to ensure Kaldi NL has access to the source videos to be transcribed. If the videos are published on the PeerTube instance, one way to download these videos is using their torrent URIs and an application such as transmission-cli.
Kaldi NL can be started by running the Kaldi NL docker image, and specifying a bind mount to give it access to a directory on the host machine. In the following example, we have created a directory in the home directory called docker_share
. The files to be processed should be placed inside this folder, nested in a subdirectory called ‘input’. Our full example command becomes:
docker run -it --mount type=bind,source=/home/ubuntu/docker_share,target=/docker_share proycon/lamachine:lamachine_1
Following this, from within the docker instance, the following command can be used to transcribe a video:
./decode_OH.sh /<input-directory>/* /<output-directory>/
This will populate the output directory with a TXT file and a CTM file, containing the generated transcript.
Preparing files for ingestion
The CTM and TXT formats are not directly supported for subtitles in PeerTube, so we will convert them to the supported SRT format.
The CTM output from Kaldi NL is formatted as a list of words per timecode. As reconstructing sentences with appropriate length from this source would be non-trivial, we will work with the TXT file. The TXT file is formatted with a sentence on each line, followed by the filename, the sentence number and the starting timecode, as shown in this subtitle about the port city of Rotterdam:
Zoals elke havenstad kent ook Rotterdam. (BG_10284.00001 0.070)
The target SRT format uses a subtitle index, a starting and ending timecode, and the subtitle text itself. The above example would look like this in the SRT format.
2
00:00:00,070 --> 00:00:02,750
Zoals elke havenstad kent ook Rotterdam.
As is visible from these examples, the SRT format requires information from two lines of Kaldi NL TXT, in order to get both the start and end point of a subtitle. We’ve written a Python script to convert a Kaldi NL TXT file into a SRT file, which can be utilized as follows:
python3 kalditxt2srt.py -i <kaldi-txt>.txt
The script takes as input a TXT file generated by Kaldi NL and generates a new file with the same name and the SRT extension and format.
Automated ingestion of subtitles into PeerTube
Once the video has been processed by the ASR software, and its output transformed into SRT format, the final step is the ingestion of the subtitles in a PeerTube instance. Adding subtitles to existing videos on a PeerTube can be scripted via the PeerTube REST API. Using this part of the API, requires a OAuth2 authorization token, as described in the documentation.
First the client ID and secret need to be known. This is done by a simple GET request:
curl https://host.name.nl/api/v1/oauth-clients/local
Then a user must generate an OAuth2 bearer token, which can be done by sending a request for a bearer token:
curl -X POST -d "client_id=<client-id>&client_secret=<client-secret>>&grant_type=password&response_type=code&username=<username>&password=<password>" https://host.name.nl/api/v1/users/token >> token.json
Ingestion of subtitles is done using a call to the captions
endpoint of the API. Subtitles are added to existing videos using a PUT HTTP request. For example, the HTTP request shown below adds a subtitle file for a specified language to an existing video, using the bearer token acquired in the previous step:
curl -X PUT -F "captionfile=@<-name>.srt" -H "Authorization:Bearer <token>" -H "Accept:application/json" -v https://host.name.nl/api/v1/videos/<your-video-identifier>/captions/<language-tag>