Kaggle NBME competition - our writeup

This is a summary of the techniques and tricks our team used in the Kaggle NLP competition. We were provided a list of notes, each note was created by a doctor evaluating a patient. Our goal was to find in the notes parts of the text that described particular features ('patient is a female', 'father died of heart attack', ...)

Backbones and ensembling

As backbones we trained various DeBERTa models, using the Huggingface implementation. Particularly good was the large model, version 3.

For our final solution we ensembled several models, including different seeds.

Keep folds constant for all runs

For all models, we kept the same separation of training data into folds, to allow apples-to-apples comparison of the cross-validation scores.

Bits and Bytes

To speed up gradient computations for the xlarge model, we used Bitsandbytes. It decreases the memory footprint by evaluating the gradients in 8 bit precision, instead of the standard 32 bit.

Usage is simple:

  import bitsandbytes as bnb
  # adam = torch.optim.Adam(...) # comment out old optimizer
  adam = bnb.optim.Adam8bit(...) # add bnb optimizer

Gradient accumulation

With Bits and bytes we also used gradient accumulation to increase the effective batch size.

Speedup

A significant speedup of inference can be achieved with the trick where the test texts are ordered by length and maximal length parameter of the tokenizer is set to the length of the longest text in the batch, not some large predetermined number (like 512). After tokenization, padding to common length across batches is fast.

In principle, the same can be applied to training as well.

Lowercase

For some tokenizers, training can be improved if we first lowercase the text.

Tokenizers and spaces

Significant improvement could be achieved by keeping track of the leading spaces. For example, "yy xx" was tokenized as "yy"-" xx", while "yy:xx" was tokenized as "yy"-":"-"xx". We must be mindful of this during inference.

v2, v3 tokenizers different

Tokenizers of DeBERTa v2 and v3 are different.

Remove newlines from predictions

Newline characters \n, \r were never part of the annotation. Those newline characters predicted by our models should be set to zero in post processing.

Pseudolabelling

Given a large number of unlabeled texts U, we applied pseudolabelling, which boosted the cross validation notably: We first train a model on the labelled data L, getting M1. Then we use M1 to predict (pseudo-)labels L1 of the unlabelled data U. We repeat the training on L+L1, getting a better model M2. We can iterate this several times.

Pretraining using MLM

Given the large number of unlabeled texts, we can also use the masked language modelling training of DeBERTa to improve the language model with which we start the training of our annontation task. There is an example script by huggingface.

Label smoothing

Given the noise in the annotations, we tried label smoothing (instead of labels being 0 and 1, we tried 0.1/0.9), but it did not improve the models.