Malformed Entry in the Training Dataset

Hi,

I think I have found a malformed entry in the training dataset (training.jsonl).

The entry with the uuid “ad9271b7-9983-42f5-9bd9-fdfcb171ddaa” is classified as “passage”, but the spoiler is constructed from 2 paragraphs. Also, the “spoilerPositions” is a single value in one paragraph and the character index is higher than the character count in the paragraph.

Dear Tom,

Thank you for pointing this out.
For reference, I include the corresponding instance from the training data here:

{“uuid”: “ad9271b7-9983-42f5-9bd9-fdfcb171ddaa”, “postId”: “4u446t”, “postText”: [“What Happens When You Get a Perfect Score on Pac-Man Will Blow Your Mind”], “postPlatform”: “reddit”, “targetParagraphs”: [“GIF”, “Billy Mitchell takes video games very seriously. He’s been called one of the greatest arcade gamers ever, and there’s even a documentary about his insane high score in Donkey Kong called The King of Kong: A Fistful of Quarters.”, “Even with his own full-length documentary film, Mitchell is still best-known for one thing: He was the first person to get a perfect score (3,333,360 points) in Pac-Man. Mitchell set the record in 1999, almost two decades after the game was originally released.”, “"You start off on the first board, and on board 21, that’s where it reaches the maximum difficulty," Mitchell said in a recent interview about the achievement. "You have to navigate all the way to board 255 doing the same repetitive thing. You can’t miss a dot, a prize, a blue man. You can’t die once."”, “In the video, Mitchell explains how, at level 256, there’s only enough memory in the game for the left-half of the board. The right-half of the game is filled with computer garble that looks something like the Matrix code.”, “"You get to the end, and there’s nothing to do but die."”], “targetTitle”: “What Happens When You Get a Perfect Score on Pac-Man Will Blow Your Mind”, “targetDescription”: “Billy Mitchell takes video games very seriously. He’s been called one of the greatest arcade gamers ever, and there’s even a documentary about his insane high score in Donkey Kong called The King of Kong: A Fistful of Quarters.”, “targetKeywords”: “Pac-Man, Arcade, Billy Mitchell, Sploid”, “provenance”: {“source”: “anonymized”, “humanSpoiler”: “The game runs out of memory and you die”, “spoilerPublisher”: “savedyouaclick”}, “spoiler”: [“at level 256, there’s only enough memory in the game for the left-half of the board. The right-half of the game is filled with computer garble that looks something like the Matrix code. "You get to the end, and there’s nothing to do but die."”], “spoilerPositions”: [[[4, 836]]], “tags”: [“passage”]}

I think that the label passage is correct because we labeled everything that is consecutive text longer than 5 words as a passage, and this is true for the example (although the example spans two paragraphs, but it is still consecutive).

The spoilerPosition does not include the end-position because the spoiler ends at the end of the document.

You are right, this is misleading, we will release a new version of the dataset that improves this. But I guess it is not too urgent for you, as you can continue working?
So that we can ship multiple minor improvements when we release the next version on Zenodo?

Best Regards,

Maik

Dear Maik,

thank you for your clarification.

It is not urgent and we can wait for the next release.

Best Regards,

Tom