Aspect Definition: Effort To Double-Check Responses

Rationale: Responses can contain hallucinated content. For responses that must be correctly grounded in some external references, the effort/time users need to verify/double-check the generated content can vary widely. E.g., I need 10 seconds to verify the birth date of Barack Obama, but I might need hours to verify if an LLM hallucinated some event that did not exist.
Unit of annotation: ‘turn’
Multi-label: yes
Required attributes: [list the attributes that are required of a conversation or turn so that it can be annotated according to the guidelines below]
Guidelines: TODO: [general instructions of what annotators should pay attention to when annotating a conversation].

  • [name of first label]: TODO: [instruction of when to choose the first label for a conversation or turn; may include a concise example]
  • [name of second label]: TODO: [instruction of when to choose the second label for a conversation or turn; may include a concise example]
  • [add more labels as needed]

Interesting! Maybe “Verifiability”? In that sense, would you think this aspect is improved when the response contains links to (trustworthy) sources?

1 Like

Yes, makes perfect sense to me. I think links to resources would indeed help here. But it would also be interesting if one could construct examples where links would not help.

Maybe one could construct a counterexample again via the existence / non existence of something. Maybe a model that tries to convince users that some event did not happen because it is not listed on some linked web page. Would be really coll if one could construct a strong motivating example in this direction.

This might be related: Baldur Bjarnason: "“Evaluating Verifiability in Generative Search En…" - Toot Café