Here is a summary of the relative performance of five notable models such as Alpaca and ChatGPT. We use GPT-4 to generate a set of challenging questions and then ask it to assess chatbots’ responses.
*DISCLAIMER: This is a fun and non-scientific experiment with GPT-4.
Through careful prompt engineering, GPT-4 is able to accurately evaluate the response quality in most cases, as shown in the example below.