Issues submitting to TIRA for retrieval augmented debate

Hello,

I’m curious if the following error is a temporary issue on TIRA’s end.

tira-cli code-submission
–path .
–task retrieval-augmented-debating-2025
–dataset rad25-2025-01-16-toy-20250116-training
–command ‘/genirsim/run.sh --configuration-file=$inputDataset/.json --parameter-file=$inputDataset/.tsv --output-file=$outputDir/simulations.jsonl’
–allow-network

I get a large traceback, but the most important piece is the following:

✓ The docker image produced valid outputs on the dataset rad25-2025-01-16-toy-20250116-training.
✓ The meta data is uploaded to TIRA.
Push Docker image to TIRA…
Push Docker image
WARNING:root:Error occured while fetching /api/task/retrieval-augmented-debating-2025/user/xxxxxxxx . Code: 500. I will sleep 9 seconds and continue.
Traceback (most recent call last):
File “/home/anthony/.local/share/uv/tools/tira/lib/python3.12/site-packages/tira/rest_api_client.py”, line 1234, in json_response
raise ValueError(f"Got statuscode {resp.status_code} for {endpoint}. Got {resp}“)
ValueError: Got statuscode 500 for /api/task/retrieval-augmented-debating-2025/user/xxxxxxxx . Got <Response [500]>
WARNING:root:Error occured while fetching /api/task/retrieval-augmented-debating-2025/user/xxxxxxxx . Code: 500. I will sleep 4 seconds and continue.
Traceback (most recent call last):
File “/home/anthony/.local/share/uv/tools/tira/lib/python3.12/site-packages/tira/rest_api_client.py”, line 1234, in json_response
raise ValueError(f"Got statuscode {resp.status_code} for {endpoint}. Got {resp}”)

When I try to submit a docker submission in the TIRA UI, I also notice a status that says “please try again” with a 500 internal error.

Problem While Loading the Docker Images.

This might be a short-term hiccup, please try again. We got the following error: Error: Error fetching endpoint: TIRA with 500

Is there anything I can do from my end, or will I just need to wait until things are back online?

Thanks,
Anthony

Dear Anthony,

Thanks for participating in the shared task and for reaching out, and sorry for the inconvenience!

I had a look, and this was a hiccup connected to your account, the token for that the docker registry created for your account was invalid, that caused the 500 response.

I refreshed the token and have ensured that the token for your account are now valid by submitting the baseline, and this worked now as expected, so I think everything should now be fine.

Best regards,

Maik

Thank you! I appreciate it. I’ve been able to get further along into the process and have been able to get a code solution through that copies the official baseline.

Hopefully our team will be able to get something in by the competition deadline.

Awesome, thanks!

And please note that it is no problem if it takes a bit longer, we can expand the deadlines on an individual basis, and I also can help in case your team has some problems. Usually we are able to get all submissions through :slight_smile:

Best regards,

Maik

1 Like

Hi @maik_froebe, I wanted to confirm the nature of the submissions. I am trying to avoid leaking my api keys into the docker container, so I have a system that reverse proxies to a service that I’m running on a personal machine exposed to the internet for the duration of the submission run. However, I’m also leaving the service on only for the duration of the submission run on TIRA. Am I understanding that the run on the toy dataset is the only thing that needs to be done here for task a?

Hi, could you please (in a private message) send me the URL/IP of this service? So that I can unblock it?

Thanks in advance and best regards,

Maik

I’m not sure how to initiate a DM here. Do you mind starting a chain, or should I send an email instead?

1 Like

Message is out :slight_smile:

I appreciate the super responsive help!

1 Like

Hi, a short question, what model would you access behind the API? Because if it is some open Hugging Face model, we could maybe also run this locally in the submission?

Best regads,

Maik

Hello, I had a few issues when trying to submit models to task 2 (evaluation). I tried to do these from the instances I submitted task 1 (debate) since our docker image supports both debate and evaluation endpoints.

[0 more lines]
Traceback (most recent call last):
  File "/evaluate.py", line 77, in <module>
    raise OSError("Mismatch in number of simulations; run:" + str(len(runs)) + " ; labels: " + str(len(labels)))
OSError: Mismatch in number of simulations; run:0 ; labels: 2

Im curious if you have more information about this. The specific run for this error is 2025-05-24-09-53-33-evaluated-run-2025-05-24-09-51-00. I have another run for the training dataset that has the following id that has a similar error 2025-05-24-09-50-55-evaluated-run-2025-05-24-09-49-10.

By the way, @acmiyaguchi seems to have hit his reply limit for the day

1 Like

I look into this :slight_smile:

2 Likes

Ah, alright, I think I now understand the problem.

The produced simulations.jsonl is empty.

This is likely caused that the inputs for task 1 and task 2 are different.
For that reason, I think it is a bit difficult that the same submission (for me a submission is an image + command) therefore is difficult to run on both tasks.
The same docker image can be used for task 1 and task 2, but the command that is to be executed in the docker image must be different, or?

For instance, in the baseline, this is distinguished via the /genirsim/run.sh --evaluate-run-file=... vs. /genirsim/run.sh --configuration-file=... flags.

Would such a flag (i.e., separating it into two submissions) solve the problem?

Best regards,

Maik

Thanks for the quick reply!

We already have a separate script set up for the evaluation run, but I thought it would have been possible to do the evaluation through the UI reusing the debate. We can make the submissions using the --evaluate-run-file flag and see how that goes.

, @acmiyaguchi

1 Like

Yes, we need to do it that way. The UI currently only allows to run a submitted software on pre-defined datasets.

But our goal is definitively to create new datasets from the generations of task 1 to feed them to models for task 2, but we need to create those datasets manually. (i.e., they are not yet ready, I think this will take a few more days, also to combine all submitted appraoches)

1 Like

Thanks for all of your help!

We were able to resolve that previous issue, but we are now facing some pydantic validation issues, and trying to figure out what’s happening on our end. Here is the error we are getting on our submissions:

[0 more lines]
Traceback (most recent call last):
  File "/evaluate.py", line 51, in <module>
    runs = [DebateEvaluations.model_validate_json(line).userTurnsEvaluations for line in f if line != ""]
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/usr/local/lib/python3.13/site-packages/pydantic/main.py", line 744, in model_validate_json
    return cls.__pydantic_validator__.validate_json(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        json_data, strict=strict, context=context, by_alias=by_alias, by_name=by_name
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
pydantic_core._pydantic_core.ValidationError: 1 validation error for DebateEvaluations
userTurnsEvaluations
  Field required [type=missing, input_value={'simulation': {'userTurn...ng at morality.'}]}}}]}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/missing

Thanks again for all of your help, it is much appreciated!

Conor

I’m not sure about the flow at the moment, but the evaluate.py assumes that “userTurnsEvaluations” is at the top level, whereas in your case it seems to be that “simulation” is at the top level (which then contains “userTurnsEvaluations”).

But, as Maik said, we will do this for the evaluation later. Once everyone submitted successfully (some time next week), we have to combine everything for blind human evaluation (that we do not know who submitted what), and after that we run everything through the sub-task 2 approaches to see how they align.

But thanks a lot for the enthusiasm :wink:

3 Likes

Appreciate all the help. Our system seems to work fine within gensimir and follows some of the latest documentation we’ve seen, but we’re happy to adjust interfaces if need be.

2 Likes