How will the answers be evaluated?

Hi, we have a question regarding evaluation.

Currently we only see BLUE score in the leaderboard, does that mean the final evaluation also will be BLUE score only?

Thanks in advance!

Dear Hiroto,

Thank you for your question!

We will include more evaluations in the final evaluation (e.g., also the METEOR score and the BERTScore).
The current evaluation on the validation dataset only includes BLUE because calculating it is very fast (compared to BERTScore) and the correlation of BLEU with METEOR and BertScore is very good (E.g., see Table 7 in the dataset paper).

Best regards,

1 Like

thanks for the quick response! Now we understand about the evaluation :slight_smile:

1 Like