We show that CodeGen (350M to 16B) pretrained on source code, and text-based CLMs (smaller than GPT2-Large) generated representations suffer from anisotropy and poor discrimination.
We show that ContraCLM enhances both isotropy and discrimination, regardless of whether the original CLMs suffer from the degenerated representations.
ContraCLM attains 44% relative improvement on the Semantic Textual Similarity tasks and 34% on Code-to-Code Search tasks. Furthermore, ContraCLM boosts source code generation capability with a 9% relative improvement in execution accuracy on the HumanEval benchmark.
• • •
Missing some Tweet in this thread? You can try to
force a refresh