Xinyi Song*, Kexin Xie, Lina Lee, Ruizhe Chen, Jared M. Clark, Hao He,
Haoran He, Jie Min, Xinlei Zhang, Simin Zheng, Zhiyang Zhang, Xinwei Deng, and Yili Hong
[Paper]
The programming capabilities of large language models (LLMs) have revolutionized
automatic code generation and opened new avenues for automatic statistical analysis. However, the validity
and quality of these generated codes need to be systematically evaluated before they can be widely
adopted. Despite their growing prominence, a comprehensive evaluation of statistical code generated by
LLMs remains scarce in the literature. In this paper, we assess the performance of LLMs, including two
versions of ChatGPT and one version of Llama, in the domain of SAS programming for statisti- cal analysis.
Our study utilizes a set of statistical analysis tasks encompassing diverse statistical topics and
datasets. Each task includes a problem description, dataset in- formation, and human-verified SAS code. We
conduct a comprehensive assessment of the quality of SAS code generated by LLMs through human expert
evaluation based on correctness, effectiveness, readability, executability, and the accuracy of output re-
sults. The analysis of rating scores reveals that while LLMs demonstrate usefulness in generating
syntactically correct code, they struggle with tasks requiring deep domain understanding and may produce
redundant or incorrect results. This study offers valu- able insights into the capabilities and
limitations of LLMs in statistical programming, providing guidance for future advancements in AI-assisted
coding systems for statistical analysis.