IR-Lab SoSe 2023: How to import your Dataset with ir_datasets

maik_froebe · April 30, 2023, 6:12am

This guide shows you how to organize an information retrieval shared task or experiment with ir_datasets in TIRA. Please contact the TIRA admins or comment below this thread if you want to organize a Shared Task using TIRA and have any questions or problems, we are happy to help.

Preparation

Please update the FROM clause of your docker image from FROM webis/tira-ir-datasets-starter:0.0.47 to FROM webis/tira-ir-datasets-starter:0.0.54. Version 0.0.47 had a minor issue so that it was not possible to render search engine result pages if no relevance judgments (qrels) were available that is fixed in version 0.0.54.

For the rest of the tutorial, we assume you have the following:

A local docker image <YOUR-DOCKER-IMAGE> containing your ir_datasets installation (please ensure it starts with FROM webis/tira-ir-datasets-starter:0.0.54 as described above)
You have tested <YOUR-DOCKER-IMAGE> locally with tira-run
You have a TIRA account and you are in two groups: tira_org_ir-lab-sose23 and tira_vm_ir-lab-sose-2023-<YOUR-TEAM>.

To check that you are in the correct group, go to your profile summary (click on your account → Profile → Summary):

On your profile, click “Expand” to show your groups, there you should find that you are in the two groups:

Screenshot_20230430_080557

If you are not in the groups, please drop me a message.

Import your Dataset

Step 1: Navigate to https://www.tira.io/task/ir-lab-jena-leipzig-sose-2023

Step 2: Click on “Import Existing Dataset”

Step 3: Upload your docker image

In case you already have uploaded your image to docker hub, you can skip this step.
If you do not have a docker hub account, you can follow the instructions to upload your image to your dedicated docker registry in TIRA:

For instance, to upload the pangram docker image from the tutorial, the commands might look like this (you have to replace tira-user-ir-lab-sose-2023-tutors with your group):

docker tag pangram-ir-dataset registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-tutors/pangram-ir-dataset
docker login ...
docker push registry.webis.de/code-research/tira/tira-user-ir-lab-sose-2023-tutors/pangram-ir-dataset

which should look like this:

Step 4: Fill out the import form and upload

First, Specify the name of your dataset, and the docker image in the formular, then click on “Import IRDS Dataset”. The formular might look like this:

After clicking on “Import IRDS Dataset” you are redirected so that you see the progress of the import.
Please refresh this page regularly to see the progress (auto refresh is on the todo list :)), as soon as the output is as expected (i.e., the return code of the import indicates success, i.e., 0 and the stdout/stderr is as you would expect it), you are done

A valid import might look like this:

If you spot an error, you can delete your dataset and create it new from scratch.

Double-check that everything is correct

After you have imported your dataset as specified above, you can check that everything is as expected.

Step 1: Navigate to https://www.tira.io/task/ir-lab-jena-leipzig-sose-2023

Step 2: Select your dataset and go to its settings

Step 3: Download the data and verify it on your system

After clicking on the settings for your datasets, scroll to “Export Dataset” and click the link “Download input for Systems”:

This will download the dataset to your system so that you can verify it.

If you spot an error, you can delete your dataset and create it new from scratch.

willi-bit · April 30, 2023, 6:37pm

Dear @maik_froebe ,
sorry for disturbing you again.
The steps of this tutorial were very clear and helpful, but now we’re questioning what should actually be submitted.
Our task from the course said that we should submit a Docker image containing a Jupyter Notebook, but the tira-run command and this submission seem to require our changed pangram image to work.

Also, tagging it like you showed results in an error for us (denied: access forbidden), I hope uploading it as a personal Docker image works as well

Thank you very much for the support and your time!

maik_froebe · May 1, 2023, 9:52am

Dear @willi-bit,

No worries, you do not “disturb”, we are happy that we have good participation in the course, and the more questions, the better

Please submit a Docker image that contains the ir_datasets integration in a Python file (i.e, .py) and the Jupyter notebook. The Python file contains the actual logic (i.e., how does the corpus look like, where the data is located, in which format, etc.), and the Jupyter notebook is intended to have “documentation” and examples. E.g., in the notebook, you can describe the ideas why you have created the corpus in the way you did it, and for example you can show one or two example documents or example topics.

For example, you can add something like this in your notebook (which would print an example document to the output):

import ir_datasets
dataset = ir_datasets.load("iranthology-<YOUR-GROUP>")

example_document = dataset.docs_store().get('<SOME-DOCUMENT-ID>')
example_document

Or the following code would print all topics:

import ir_datasets
dataset = ir_datasets.load("iranthology-<YOUR-GROUP>")

for query in dataset.queries_iter():
    print(query)

Does this answer your questions?

For the access forbidden error: this sounds like your personalized docker login statement failed? Was your docker login successful, or did it fail? Uploading it to a different registry like dockerhub is fine as well, but it should also work with your dedicated registry in TIRA. Can you please send me the commands that you have executed in a private direct message, so that I can look into the problem?

Thanks in advance and best regards,

Maik

fey · May 1, 2023, 10:59am

Hello Maik,

thank you for the information. We are a different group with a similar problem and have a follow up question on this:
You say that we are to submit a Python file and a Jupyter notebook. This is news to us, as according to the outline and to the tutorials we should only submit the notebook containing the various code, documentation, and reflection cells. Should the code in the Python file be different from the Jupyter notebook? Should this code not be commented? We must admit that we are a bit confused by this and would appreciate a clarification.

Thank you so much!

Best,

Fey

maik_froebe · May 1, 2023, 11:13am

Hi Fey,

The code in the pyton file can be identical to the Jupyter notebook.
In the end, the Python file is used (as ir_datasets is a library that does not scan for Jupyter notebooks), but the Jupyter notebook would out of the box be able to use the code from the python file.

Maybe we can see it in a way that you put the registration of the dataset into the python file (as in this tutorial), and then you can in the notebook just output the file? I.e., !cat <YOUR-FILE> so that you have a “single source” for the definition, and than all other parts remain in the Jupyter notebook? I.e., all reflections, etc.?

Does this make sense to you? (for me, this would be a good and reasonable way to submit)

Please do not hesitate to ask in case of further questions/problems.

Best regards,

Maik

maik_froebe · May 1, 2023, 11:16am

Sorry that there are some rough edges. We do this for the first time, but I think we already learned much on what we can (and will) improve for the next iterations. I hope it is still “smooth” enough

Best regards,

Maik

mkober · May 1, 2023, 2:13pm

Hello Maik,

as a fellow member of Fey’s group, thank you for the explanation. We have a few follow-up questions:

If I’m understanding correctly, the python file should contain both our data pre-processing steps (converting the ir-anthology-07-11-2021-ss23.jsonl into a new jsonl file with only the tags “doc_id” and “text”) and the registration of our thusly created dataset, correct?
We are struggling with the actual registration part. We realise that the pangrams.py file from the tutorial is meant to serve as a template, but we don’t understand how to adapt it to our purposes.
Adapting the class to our dataset is no problem (although we don’t fully understand why this class is needed, but that’s another matter). However, when it comes to the actual registration part, we are completely lost:

Is there any documentation for the ir_datasets.registry.register(...) step that we are missing? We don’t understand the parameters we need to pass, such as expected_md5 and doc_cls.
Is it necessary that we upload our jsonl and xml files to Github to do this step, or is there a way to register them from our local machines?
Do we need to create a Qrels file (the assignment does not mention it, but it is part of the tutorial), and if so, how do we do that?

Thank you and best regards
Mirjam

maik_froebe · May 1, 2023, 2:31pm

Dear Mirjam,

To answer your questions:

You only need to include the processed data, not the raw data. You can have more tags than “doc_id” and “text”, but those two should be there.
There is no real documentation for ir_datasets.registry.register(...), but I can help there. If you have problems, please send me a link to your git repository so that I can adjust it that it works (is likely faster than writing). You do not have to upload the jsonl or XML to github, you can load it from your local machines. You also don’t need qrels yet, you can just leave this out.

In a different chat, I linked this example on how to add the files from your local system https://github.com/tira-io/ir-experiment-platform/tree/ir-datasets-use-local-files-no-qrels/ir-datasets/tutorial.

Maybe the easiest way forward is if you send me the link to your git repository in a private chat, and I help you to adjust it?

Best regards,

Maik

willi-bit · May 1, 2023, 8:58pm

Dear @maik_froebe ,

sorry, I wanted to write you a DM but I cant yet because of my account age.
Even now we are not sure on how to finally submit the image.
We are struggling with finding a way on how to add our notebook to our Python/tira-run image the whole day. Nowhere is a process like this mentioned so we are clueless right now.
It might already be too late but we want to know which crucial step we missed or didn’t see.
Thank you very much!

Best regards
Willi

maik_froebe · May 1, 2023, 10:03pm

Dear Willi,

Sorry that the process was not clear enough and that you struggled with this the whole day.
I looked into your submission, and everything looks perfect.
So the Jupyter notebook is there (just adding the notebook somewhere into the docker image is enough), it documents the steps and has the reflection, and also the ir_datasets integration works perfectly.

So everything is fine, you missed nothing.

We will incorporate the learnings from the submissions of the first milestone into the submissions of the second milestone, to make the process smoother and clearer for milestone 2 and 3, but I can say that you all did very great so far!

Best regards,

Maik