How will the answers be evaluated?

Currently we only see BLUE score in the leaderboard, does that mean the final evaluation also will be BLUE score only?

We will include more evaluations in the final evaluation (e.g., also the METEOR score and the BERTScore).
The current evaluation on the validation dataset only includes BLUE because calculating it is very fast (compared to BERTScore) and the correlation of BLEU with METEOR and BertScore is very good (E.g., see Table 7 in the dataset paper).

thanks for the quick response! Now we understand about the evaluation :slight_smile:

