MD5 Mismatch for PAN 2025 Dataset on Zenodo

OjaswaVarshney · May 23, 2025, 5:17pm

Hi TIRA team,

I’m trying to submit to the PAN 2025 Task 1 (generative-ai-authorship-verification-panclef-2025), but the CLI is failing with:

csharp

CopyEdit

MD5 is unexpected: I expected "fd12cbb06a882276278655acc949b91d" but got "237de7f2289e34646935c31788d450ad"

This happens on both regular and --dry-run submissions. Could you please update the expected hash or advise?

maik_froebe · May 23, 2025, 5:27pm

Dear @OjaswaVarshney ,

thanks for reaching out!

I think what you describe happens when you the --dataset pan25-generative-ai-detection-smoke-test-20250428-training argument is missing when submitting to --task generative-ai-authorship-verification-panclef-2025.

The command here is correct: pan-code/clef25/generative-authorship-verification/pan25_genai_baselines at master · pan-webis-de/pan-code · GitHub

(if the --dataset pan25-generative-ai-detection-smoke-test-20250428-training argument is missing, it tries to run the evaluation on the validation data from zenodo, but this is restricted in its access.)

Does this resolve your problem?

Best regards,

Maik

maik_froebe · May 24, 2025, 6:08am

For all who encounter the same problem, here is a colab notebook that shows how to load via the dataset id above.

https://colab.research.google.com/drive/12_pGh02ToXvLaFPIuO8np7UfnyJNu0Be?usp=sharing

Best regards,

Maik

OjaswaVarshney · May 25, 2025, 8:19am

Hi
still getting errors.
i’ve attached the error for your reference

maik_froebe · May 25, 2025, 8:40am

Hi,

the error message Resouce stopwords not found indicates that they are likely not installed in the Docker image.

Could you please invite me to your github repository (my account name is mam10eks), then I can help to finalize the submission.

Thanks in advance!

Best regards,

Maik

OjaswaVarshney · May 25, 2025, 9:23am

Hi
I’ve sent you invitation for my repo.

maik_froebe · May 25, 2025, 9:27am

Awesome, thanks, I will look into this and will report back soon