π¨ Can response generation models read between the lines? Our π #EMNLP2021 paper probes RG models to see if they can identify common sense reasons by annotating CS explanations in dialogues and evaluating RG models for CS reasoning capabilities.
We formalize CS as a *latent variable* that helps explain the observed variable βresponseβ in the RG process and instantiate CS using textual explanations of the response.
To collect annotations on CS explanations that justify dialogue responses. We first generate candidates by adopting a large T5 model trained on a story explanation dataset, GLUCOSE (@nasrinmmm et al). Next, we conduct a carefully designed two-stage human verification process.
To understand whether RG models can comprehend implicit CS, we *corrupt* explanations to break the logical coherence or grammar and compare model behaviors between a valid explanation and a corrupted one.
We find that SOTA RG models fail to understand CS that justifies proper responses according to performance on our probing settings and some models even do not distinguish gibberish sentences! Fine-tuning on in-domain dialogues and verified explanations do not help.