Evaluation code for BERTScore and METEOR

tohoku-nlp · February 11, 2023, 6:59am

Dear Maik,
We are currently preparing our system submission paper and trying to analysis our own submitted system.
We want to re-evaluate some of our initial system with other metrics than BLEU score like BERTScore and METEOR, is it possible to provide us the evaluation code (or docker images) you used in the evaluation?

Thanks in advance!

Best,
Hiroto

maik_froebe · February 11, 2023, 7:30am

Dear Hiroto,

Yes, this is no problem, the docker image that I have used is webis/pan-clickbait-spoiling-evaluator:0.0.11.

If you want to look into the code or tests, you can see the dockerfile here: pan-code/semeval23 at master · pan-webis-de/pan-code · GitHub, and the actual code shipped in the docker image is here: pan-code/clickbait-spoiling-eval.py at master · pan-webis-de/pan-code · GitHub

Best Regards,

Maik

tohoku-nlp · February 18, 2023, 1:14pm

@maik_froebe
Dear Maik,

We run the evalutation code to reproduce the score of meteor and BERTScore but we run into following error. Sorry but would help us fix this problem?

Thanks in advance!
Best,

Hiroto

  [o] The file /inputs/inferenced.json is in JSONL format.
  [o] The file /data/validation.jsonl is in JSONL format.
  [o] The file /data/validation.jsonl is in JSONL format.
  [o] Spoiler generations have correct format. Found 800
  [o] Spoiler generations have correct format. Found 800
/usr/local/lib/python3.6/site-packages/nltk/translate/bleu_score.py:552: UserWarning: 
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
  warnings.warn(_msg)
/usr/local/lib/python3.6/site-packages/nltk/translate/bleu_score.py:552: UserWarning: 
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
  warnings.warn(_msg)
/usr/local/lib/python3.6/site-packages/nltk/translate/bleu_score.py:552: UserWarning: 
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
  warnings.warn(_msg)
ikumi-ito@ga02 ~/semeval-2023 % bash seq2seq/evalate.sh /work00/semeval2023/tosubmit/searchsteps-clickbait-flan-t5-large-seed43/checkpoint-1800
  [o] The file /inputs/inferenced.json is in JSONL format.
  [o] The file /data/validation.jsonl is in JSONL format.
  [o] The file /data/validation.jsonl is in JSONL format.
  [o] Spoiler generations have correct format. Found 800
Run evaluation for all-spoilers
  [o] Spoiler generations have correct format. Found 800
/opt/conda/lib/python3.7/site-packages/nltk/translate/bleu_score.py:552: UserWarning: 
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
  warnings.warn(_msg)
/opt/conda/lib/python3.7/site-packages/nltk/translate/bleu_score.py:552: UserWarning: 
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
  warnings.warn(_msg)
/opt/conda/lib/python3.7/site-packages/nltk/translate/bleu_score.py:552: UserWarning: 
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
  warnings.warn(_msg)
Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Exception in thread "main" java.lang.RuntimeException: Error: file not found (file:/data/paraphrase-en.gz)
        at edu.cmu.meteor.aligner.ParaphraseTransducer.<init>(Unknown Source)
        at edu.cmu.meteor.aligner.Aligner.setupModules(Unknown Source)
        at edu.cmu.meteor.aligner.Aligner.<init>(Unknown Source)
        at edu.cmu.meteor.scorer.MeteorScorer.loadConfiguration(Unknown Source)
        at edu.cmu.meteor.scorer.MeteorScorer.<init>(Unknown Source)
        at Meteor.main(Unknown Source)
Traceback (most recent call last):
  File "./clickbait-spoiling-eval.py", line 339, in <module>
    eval_task_2(input_run, ground_truth_classes, ground_truth_spoilers, args.output_prototext)
  File "./clickbait-spoiling-eval.py", line 320, in eval_task_2
    for k,v in create_protobuf_for_task_2(input_run, filtered_ground_truth_spoilers).items():
  File "./clickbait-spoiling-eval.py", line 305, in create_protobuf_for_task_2
    'meteor-score': meteor_score(y_true, y_pred),
  File "./clickbait-spoiling-eval.py", line 270, in meteor_score
    meteor_output = subprocess.check_output(cmd).decode('utf-8')
  File "/opt/conda/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/opt/conda/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['java', '-Xmx2G', '-jar', '/meteor-1.5.jar', '/tmp/tmp54d_mw93/truths.txt', '/tmp/tmp54d_mw93/preds.txt', '-l', 'en', '-norm', '-t', 'adq']' returned non-zero exit status 1.

maik_froebe · February 19, 2023, 6:07am

Dear Hiroto,

Thank you for reaching out!
Have you used the docker image that we have made available on Dockerhub?
This image pre-packages everything into the image, e.g., also the /data/paraphrase-en.gz that seems to be missing in the stacktrace. (if you want to use it without docker, you can apply the setup steps from within the Dockerfile to install all dependencies to the expected locations, but using the docker image that exists would be much simpler.)

The command then would be:

docker run \
            -v <the-directory-with-your-predictions>:/input \
            -v <the-directory-with-the-ground-truth>:/truth --rm -ti \
            webis/pan-clickbait-spoiling-evaluator:0.0.11 \
            --task <task> --ground_truth_spoilers /truth/<spoiler-ground-truth-file> --ground_truth_classes /truth/<spoiler-ground-truth-file>.jsonl --input_run /input/run.jsonl --output_prototext /input/evaluation.prototext

Thank you for doing additional evaluations!

Best Regargds,
Maik

tohoku-nlp · February 19, 2023, 11:32am

I followed the commands you gave me and was able to do the evaluation successfully!

Your quick response was very helpful!
Thank you very much.

Hiroto

maik_froebe · February 19, 2023, 12:34pm

Super cool!

Thanks for reporting back!

Best regards,

Maik