Aspect Definition: Effort To Double-Check Responses

maik_froebe · October 31, 2023, 9:56am

Rationale: Responses can contain hallucinated content. For responses that must be correctly grounded in some external references, the effort/time users need to verify/double-check the generated content can vary widely. E.g., I need 10 seconds to verify the birth date of Barack Obama, but I might need hours to verify if an LLM hallucinated some event that did not exist.
Unit of annotation: ‘turn’
Multi-label: yes
Required attributes: [list the attributes that are required of a conversation or turn so that it can be annotated according to the guidelines below]
Guidelines: TODO: [general instructions of what annotators should pay attention to when annotating a conversation].

[name of first label]: TODO: [instruction of when to choose the first label for a conversation or turn; may include a concise example]
[name of second label]: TODO: [instruction of when to choose the second label for a conversation or turn; may include a concise example]
[add more labels as needed]

johanneskiesel · October 31, 2023, 8:11pm

Interesting! Maybe “Verifiability”? In that sense, would you think this aspect is improved when the response contains links to (trustworthy) sources?

maik_froebe · November 1, 2023, 5:33pm

Yes, makes perfect sense to me. I think links to resources would indeed help here. But it would also be interesting if one could construct examples where links would not help.

Maybe one could construct a counterexample again via the existence / non existence of something. Maybe a model that tries to convince users that some event did not happen because it is not listed on some linked web page. Would be really coll if one could construct a strong motivating example in this direction.

maik_froebe · November 2, 2023, 6:59am

This might be related: Baldur Bjarnason: "“Evaluating Verifiability in Generative Search En…" - Toot Café