As part of our information retrieval exercise, we currently build an ir_datasets integration for the IR-Anthology to conduct retrieval experiments with TIRA.
We follow this tutorial: https://github.com/tira-io/ir-experiment-platform/tree/main/ir-datasets/tutorial.
We are not sure on how we should make the data, i.e., our corpus, topics, etc. available in the integration.
We are unsure whether we should upload the files to our own Git repository so that Tira can download them from there during registration, or should we integrate them with the official IR-Dataset-Git?
Thanks for asking the question, I think others will have the same question as well!
In the end, I think having as many datasets integrated into the official ir_datasets git repository as possible is a good goal.
However, we should not try to push “work in progress” into the official repository, and we should only push mature code/datasets.
Hence, I think it would be best to have everything in your own git repository for the moment.
As soon as everything is mature, e.g., at the end of the semester (when we learned what works and what not, and when we have everything ready, including relevance judgments), we can together integrate everything into the official ir_datasets repository. We would of course be very happy to assist you with that, and I think that it would be very valuable for others as well.
Does this answer your question?
Please do also not hesitate to ask further questions in case there is something not clear.