A team from IIIT Delhi published a paper on performance of several LLMs for generating unit tests.
ChatGPT operates with a focus on understanding and generating content in natural language rather than being explicitly tailored for programming languages. While ChatGPT may be capable of achieving high statement coverage in the generated unit tests, a high percentage of the assertions within those tests might be incorrect. A more effective approach to generating correct assertions would be based on the actual semantics of the code. This presents a concern that ChatGPT may prioritize coverage over the accuracy of the generated assertions, which is a potential limitation in using ChatGPT for generating unit tests, and a more semantic-based approach might be needed for generating accurate assertions.
When prompted to generate tests, ChatGPT will go ahead and generate tests that increase code coverage to the specified level, but the assertions are often incorrect. I’d be curious to see how the results would change if the tests were run on a more recent LLM. Which brings me to a frustration with papers like this - the paper was published in 2024, but studies ChatGPT running GPT-3.5, which was released in late 2022. Academic papers are often published long after the LLM models they study have been superseded. While still useful in the academic context, it means that companies wishing to adopt the latest LLM models are left to their own devices to figure out how to do so and how to interpret the results.