Registering dataset into ir_datasets

ybrenning · April 28, 2023, 1:37pm

The final step on the assignment sheet for the first milestone is to register the dataset into ir_datasets (ir-experiment-platform/pangrams.py at main · tira-io/ir-experiment-platform · GitHub).

We are not sure how we are meant to do this… It seems to be implied that we need to replace the name “pangrams” with the name of our group in a certain format (iranthology-), but other than that we are not sure how to complete this registry. We don’t need to download anything from Git either, because we have the files on the computer. What can we use instead of ir_datasets.utils.Download()?

maik_froebe · April 29, 2023, 2:15pm

Dear @ybrenning,

Thanks for the request.

You are right, from the file that you mentioned you should replace this line

ir_datasets.registry.register('pangrams', Dataset(

with something in that direction:

ir_datasets.registry.register('iranthology-<YOUR-GROUP-NAME>', Dataset(

If you do not need to download the data, you can embed the data into the docker image.
Therefore, you can adjust the corresponding line in the Dockerfile from:

COPY pangrams.py /usr/lib/python3.8/site-packages/ir_datasets/datasets_in_progress/

To something in that direction (this copies the qrels, topics, and documents into the container):

COPY pangrams.py pangram-qrels.txt pangram-topics.xml  pangram-documents.jsonl /usr/lib/python3.8/site-packages/ir_datasets/datasets_in_progress/

In the ir_datasets configuration, you now don’t have to use ir_datasets.util.Download but instead you can use ir_datasets.util.PackageDataFile. I.e., if you copy the data to the docker image as mentioned above, the complete registration should look like this:

ir_datasets.registry.register('iranthology-<YOUR-GROUP-NAME>', Dataset(
    JsonlDocs(ir_datasets.util.PackageDataFile(path='datasets_in_progress/pangram-documents.jsonl'), doc_cls=PangramDocument, lang='en'),
    TrecXmlQueries(ir_datasets.util.PackageDataFile(path='datasets_in_progress/pangram-topics.xml'), lang='en'),
    TrecQrels(ir_datasets.util.PackageDataFile(path='datasets_in_progress/pangram-qrels.txt'), {0: 'Not Relevant', 1: 'Relevant'})
))

Does this work for you?

Best regards,

Maik

willi-bit · April 29, 2023, 6:34pm

Hey @maik_froebe,
I’m not the original author of this question but we encountered some problems using the steps you described. With the ‘downloading method’ the tira-run command runs perfectly fine for us, but when we switched to your method we only got failed attempts.
There was no error message, the process just failed outright without any additional information:
Task: Full-Rank → create files:
documents.jsonl
queries.jsonl
qrels.txt
at /tira-data/output/

Load Documents: 0it [00:00, ?it/s]
Load Documents: 610it [00:00, 6095.01it/s]
Load Documents: 1801it [00:00, 9503.96it/s]
Load Documents: 3154it [00:00, 11340.55it/s]
Load Documents: 4508it [00:00, 12204.56it/s]
Load Documents: 5808it [00:00, 12487.90it/s]
Load Documents: 7121it [00:00, 12703.89it/s]
Load Documents: 8392it [00:00, 12282.92it/s]
Load Documents: 9623it [00:00, 11580.74it/s]
Load Documents: 10790it [00:00, 10888.25it/s]
Load Documents: 11890it [00:01, 10906.01it/s]
Load Documents: 13216it [00:01, 11582.18it/s]
Load Documents: 14470it [00:01, 11848.16it/s]
Load Documents: 15663it [00:01, 10933.89it/s]
Load Documents: 17523it [00:01, 13066.08it/s]
Load Documents: 18983it [00:01, 13497.31it/s]
Load Documents: 20357it [00:01, 13560.25it/s]
Load Documents: 21730it [00:01, 12675.70it/s]
Load Documents: 23020it [00:01, 12498.30it/s]
Load Documents: 24285it [00:02, 12399.08it/s]
Load Documents: 25782it [00:02, 13129.42it/s]
Load Documents: 27628it [00:02, 14669.47it/s]
Load Documents: 29109it [00:02, 14516.72it/s]
Load Documents: 30571it [00:02, 14416.66it/s]
Load Documents: 32020it [00:02, 14356.90it/s]
Load Documents: 33461it [00:02, 13664.35it/s]
Load Documents: 34837it [00:02, 13236.28it/s]
Load Documents: 36320it [00:02, 13686.01it/s]
Load Documents: 37840it [00:02, 14117.42it/s]
Load Documents: 39260it [00:03, 13663.94it/s]
Load Documents: 40828it [00:03, 14240.14it/s]
Load Documents: 42605it [00:03, 15264.61it/s]
Load Documents: 44888it [00:03, 17481.98it/s]
Load Documents: 47090it [00:03, 18820.22it/s]
Load Documents: 49226it [00:03, 19572.96it/s]
Load Documents: 51192it [00:03, 18738.33it/s]
Load Documents: 53079it [00:03, 17740.63it/s]
Load Documents: 53673it [00:03, 13915.08it/s]

maik_froebe · April 30, 2023, 4:15am

Dear @willi-bit,

Thank you for testing the described method.
To simplify the comparison, I created a new branch that uses ir_datasets.util.PackageDataFile instead of the Download method that works with the tira-run command (at least on my machine xD).

These are my changes (that are described above, but here in the git diff) and this is the resulting file: https://github.com/tira-io/ir-experiment-platform/blob/ir-datasets-use-local-files/ir-datasets/tutorial/pangrams.py

Running the tira-run command yields the outputs that I expect and also finishes with an exit code of 0:

Maybe this comparison helps you?
Maybe the problem is that you not yet have qrels (which would make sense)?
I test this now without qrels, and will report back.

Best regards,

Maik

maik_froebe · April 30, 2023, 5:53am

Dear @willi-bit,

I now also tried it when no qrels were available, which also worked (in this branch).
Maybe the comparison to the branch helps you?

If not, can you please share a link to your repository so that I can have a look there?

Furthermore, in the later steps of the tutorial were a problem with missing qrels, as it was not possible to render the search engine result pages.
I fixed this problem so that one can render SERPs even without qrels.
Can you please update the FROM clause of your docker image from FROM webis/tira-ir-datasets-starter:0.0.47 to FROM webis/tira-ir-datasets-starter:0.0.54? (if you start from the docker image FROM webis/tira-ir-datasets-starter:0.0.54 the problem with rendering SERPs without qrels is resolved, I updated the tutorial accordingly)

Thanks in advance and best regards,

Maik

maik_froebe · April 30, 2023, 7:28am

Dear all,

As the previous tutorial on how to import datasets to TIRA for milestone 1 was not specifically tailored to the IR lab, I created a new tutorial specifically for the IR lab that you can find here: https://www.tira.io/t/ir-lab-sose-2023-how-to-import-your-dataset-with-ir-datasets. Please use this new tutorial (the old one has the same content, but is a bit more high level) and do not hesitate to ask/comment in the forum in case there is something not clear.

Best regards,

Maik