Problems getting the Transformer to run in a Docker Container

Christian · November 2, 2022, 9:33pm

I have this problem in general, but i will explain it with the Transformer Baseline1:
Instead of downloading the transformer when creating the Container like in Baseline 1, I downloaded the transformer model beforehand and used the copy command to get it into the container.
So that I can later upload my own transformers to Tira with this.

That means I have removed this Dockerfile part:

RUN apt-get update \
	&& apt-get install -y git-lfs wget \
	&& wget 'https://raw.githubusercontent.com/tira-io/tira/development/application/src/tira/templates/tira/tira_git_cmd.py' -O '/opt/conda/lib/python3.7/site-packages/tira.py' \
	&& git clone 'https://huggingface.co/webis/spoiler-type-classification' /model \
	&& cd /model \
	&& git lfs install \
	&& git fetch \
	&& git checkout --track origin/deberta-all-three-types-concat-1-checkpoint-1000-epoch-10 \
	&& rm -Rf .git

Downloaded the files beforehand and instead packed the files(config.json,pytorch_model.bin,special_tokens_map.json,training_args.binm,model_args.json,scheduler.pt,tokenizer_config.json and vocab.json) with the COPY command into a subfolder (in my case /checkpoint and not /model).

To me, the container seems to have the same content.
I only changed the folder name in the code.

Locally, I can execute everything without a container, but something goes wrong in the container and I get this error when trying to execute the container:

Traceback (most recent call last):
  File "/transformer-baseline-task-1.py", line 63, in <module>
    run_baseline(args.input, args.output)
  File "/transformer-baseline-task-1.py", line 55, in run_baseline
    for prediction in predict(input_file):
  File "/transformer-baseline-task-1.py", line 42, in predict
    model = ClassificationModel('deberta', './checkpoint', use_cuda=False)
  File "/opt/conda/lib/python3.7/site-packages/simpletransformers/classification/classification_model.py", line 471, in __init__
    tokenizer_name, do_lower_case=self.args.do_lower_case, **kwargs
  File "/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1784, in from_pretrained
    **kwargs,
  File "/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1930, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/deberta/tokenization_deberta.py", line 134, in __init__
    **kwargs,
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/gpt2/tokenization_gpt2.py", line 192, in __init__
    with open(merges_file, encoding="utf-8") as merges_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType

Which I can’t understand, because the files should actually be there.
With the >ls foldername< I can also verify that the Transformer data is in the container.

I tried loading the model with full path and relative path, but I have no idea how to fix this. Because the files are in the container and can be loaded locally and in the baseline Transformer Container via Huggingface .

maik_froebe · November 3, 2022, 5:51am

Dear Christian,

I recommend always using the absolute path in a docker image, not a relative path.
I can only guess from the stacktrace, that merges_file is None, as this would explain the TypeError.

Without a link to the container or the code, I can not help much, but I would recommend to debug why merges_file is None.

A side node: From the stack trace, I see that your model always sets use_cuda=False.
This is something that I also want to improve in the baseline, I fix this today and then drop a message here, because this would cause the model to not even use a GPU even if a GPU would be available, so I would suggest that one checks if a GPU is available, and if this is the case, one should set use_cuda=True, otherwise use_cuda=False.
I will update the baseline accordingly to reflect this there as well.

I hope this helps you, please do not hesitate to ask further questions in case your container remains failing.

Best Regards,

Maik

maik_froebe · November 3, 2022, 6:03am

I now improved the baseline, so that it uses a GPU if available, and the CPU if not.

Basically, I added a new method use_cuda that returns true if a GPU is available, and false otherwise, so that this can be passed to the ClassificationModel of the transformer baseline for task 1. E.g.:

def use_cuda():
    return torch.cuda.is_available() and torch.cuda.device_count() > 0

ClassificationModel('deberta', '/model', use_cuda=use_cuda())

This is the corresponding commit:

(I was not yet able to test this at the moment, because I am in a train with a bad internet connection, but as soon as I have a good connection, I will test this and report back.)

Best Regards,

Maik

maik_froebe · November 3, 2022, 7:09am

Dear all,

I now had the time to test the commit above, and it works.
So please use the new version of the transformer baseline for task 1 as starting point.

So the baseline now uses a GPU if available, and no GPU if no GPU is available (and in all cases it uses more CPUs if available).

This is quite interesting for studying the scaling behaviour (at least to some degree) by running a software on different resource specifications.

For instance, here I run the baseline in TIRA using 1 CPU, 10 GB RAM, and 0 GPUs and this takes roughly 80 minutes for its execution:

The baseline utilizes more CPUs if available, e.g., with 4 CPU, 10GB RAM, and 0 GPUs takes roughly 40 minutes for execution:

As this baseline uses a transformer, using a GPU speeds things up quite substantially (here 1 CPU core, 10 GB of RAM, and 1 GPU) takes roughly 1 minute for execution:

I find such scalability observations quite interesting (even if they are not too advanced)

Best Regards,

Maik