Evaluation after upload to TIRA not working

ybrenning · June 15, 2023, 11:18am

As a test run, we decided to use the multi-field approach from the milestone 3 example locally, which worked and generated run.txt as expected. However, when we uploaded the image to TIRA and let it run (step 3 of milestone 3), the results show empty values for the effectivness scores despite successfully completing the submission/run:

{
  "P@10": "0.0",
  "RR": "0.0",
  "nDCG@10": "0.0"
}

Is this an error on our end or is there an issue with the effectiveness evaluation on the side of TIRA?

maik_froebe · June 15, 2023, 12:02pm

Dear @ybrenning,

Thanks for reaching out!
The run itself looks correct, but still, the effectiveness scores for all of the three measures is 0.
I looked into the run and into the relevance judgments, and all documents that I have looked into are unjudged, and the default assumption is that unjudged documents have a relevance of 0, so it looks like all the documents retrieved by your approach are unjudged and therefore assumed to be non-relevant.

Can you try a BM25 baseline as well please?

Best regards,
Maik

maik_froebe · June 15, 2023, 12:06pm

So it might be that the multi-field approach retrieves much different documents than the approaches that we pooled. But it is also a bit weird that the reciprocal rank is 0, which means that it never found a relevant document in the top 1000 results, this is a bit unrealistic given that the corpus only contains ~50000 documents. Have you rendered the results as SERPs, do they look reasonable? Maybe there is a “off by one” error in the sense that it retrieves against the wrong queries?

Best regards,

Maik

maik_froebe · June 15, 2023, 1:52pm

Dear @ybrenning,

I looked into the results of the run, and they seem to be not relevant for the query, e.g., for the query on “fake news detection”, the top 3 results in the run mentioned above are those:

Unsupervised public health event detection for epidemic intelligence
Did You Really Just Have a Heart Attack?: Towards Robust Detection of Personal Health Mentions in Social Media
Using Health-Consumer-Contributed Data to Detect Adverse Drug Reactions by Association Mining with Temporal Analysis

So it looks like it retrieved results for a different query.
Does it work on your system?

Best regards,

Maik

ybrenning · June 23, 2023, 9:18am

We managed to figure out that this is due to the fact that we opted for the tutor dataset instead of our own. Since the tutor dataset folder has different topics than the ones we chose, it probably produced the wrong results for us.

Does this mean that we should just use the topics provided in the tutors dataset? Because we were trying to just use the documents and still use our own topics since we weren’t sure whether we are allowed to use the tutor topics.

maik_froebe · June 24, 2023, 11:37am

Hi,

You can use the tutor dataset, specifically, you can submit against the following datasets in TIRA:

iranthology-ir-lab-sose2023-information-retrievers-all-topics
iranthology-20230618-training

Everything should work with this, and you can tune/develop your approach on your dataset of milestone 1.

Best regards,

Maik

kostek · June 25, 2023, 6:55pm

We tried using the tutors dataset also and we ran into the same problem. Could you provide an image for the qrels appropriate to the tutor dataset so we can see whether or not we improved our approach?

maik_froebe · June 25, 2023, 7:55pm

Dear all,

We have not created qrels for the tutors dataset.
The tutor’s dataset was only intended to showcase the structure with one judgment per query, which is not enough for development.

Please use your topics and qrels for development.
You can (as mentioned by @ybrenning above) use the documents of the tutors dataset. For this, you still have to use your own topics and qrels.

If you have problems with replacing the documents, please send me a link to your git repo in a private message, I can have a look there.

Best regards,

Maik

ybrenning · June 26, 2023, 11:40am

Okay, so this newly registered dataset iranthology-ir-lab-sose2023-information-retrievers-all-topics contains the tutor’s document structure along with our topics?

Or are we meant to register a new dataset if that is what we want to do? Because the tutor dataset obviously has the tutor topics and our previous dataset has a different document structure than the one we want to use, so the retrieval approaches we have written wouldn’t really work on either one as far as I understand.

maik_froebe · June 27, 2023, 8:42am

Sorry that this was not clear.

iranthology-ir-lab-sose2023-information-retrievers-all-topics contains your documents with all topics
iranthology-20230618-training contains the tutor’s documents with all topics

So you don’t need to register a new dataset, it should be everything there (if you have developed your retrieval approach against the document structure of the tutors dataset, you should use iranthology-20230618-training, if you have developed your retrieval approach against the document structure of your dataset from milestone 1, you should use iranthology-ir-lab-sose2023-information-retrievers-all-topics)

So both datasets work against the same topics, but use a different document representation.

Best regards,

Maik

ybrenning · June 27, 2023, 9:15am

Thanks for the clarification, the retrieval approaches we have been making were all based on the tutor’s document structure since we thought that combining that document structure with our topics would be a lot easier than it’s turning out to be…

Aynways, we noticed than whenever we try to run the container on a certain dataset within TIRA, the execution takes really long and usually just ends up not finishing the evaluation and being displayed in a list at the bottom saying “This run has not been reviewed yet. A task organizer will check each run and review its validity”.

Does this mean every run we try to test in TIRA for our approaches has to be manually reviewed before actually showing us what scores we were able to get?

We have developed a few retrieval approaches but because of all our issues with uploading and running within TIRA, we haven’t been able to log anything into the scoreboards as of now…

maik_froebe · June 27, 2023, 9:28am

Yes, we manually review the submissions, but this usually does not take long, I would say in most cases not more than one day. When it is not reviewed, you still can see the outputs and the scores when clicking on Evaluation respectively Inspect.

I started your three new submissions on the iranthology-20230618-training dataset, with this, everything should be perfect, and they should then also appear on the leaderboard.

(I report back when they are finished, but should not take too long.)

Thanks and best regards,

Maik

ybrenning · June 27, 2023, 9:37am

Okay, thanks for the explanation again.

We still aren’t really sure why the dataset names in the table are different from the ones we select when we create a run:

Is iranthology-tutors-20230502-training corresponding to the dataset iranthology-tutors? If so, what dataset is the second one? Would that just be iranthology?

Thanks again for the help so far!

Yannick

maik_froebe · June 27, 2023, 9:46am

Yes, this is something that we have to improve in the UI, you are right. The table there adds a suffix to the dataset name containing the version of the dataset (i.e., this is the timestamp and the -training).

The executions were now successful, so everything looks perfect

Best regards,

Maik

ybrenning · June 27, 2023, 10:14am

Okay, thanks again!

The runs using the tutors dataset still output only zero for us, for some reason I guess they are still running on the tutor’s topics and then comparing those results with the qrels from our topics? Would be my guess at least… We’re still testing it out with some of our different approaches but for some reason, the results don’t seem to be changing much even though the retrieval approaches seem to be working locally and producing different scores for us.

maik_froebe · June 29, 2023, 5:18am

Yes the iranthology-tutors dataset was the dataset that I meant that has no relevance judgments, so the scores are zero.

I think it can be reasonable that the scores change more on the small training/development dataset but less on the larger test set. You also can look into the outputs or resulting SERPs to verify if everything is as intended (or maybe you uploaded the wrong software? but from the evaluation everything looks good).

(Also, I see that the scores vary largely, e.g., so one of your submission has an nDCG@10 of 0.48, another one 0.51, and another one 0.55, so this is quite a substantial difference)

Best regards,

Maik