Xinyi Song

Blacksburg, VA EMAIL

I am currently fifth-year PhD student of Statistics in ,Virginia Tech advised by Dr. Yili Hong, focusing on interface between AI, ML and statistics. I completed my Master's degree in Statistics from University of Illinois at Urbana-Champaign and Bachelor's degree in Statistics from China University of Geosciences.

I've had the pleasure of interning at:

Research Interests:
  • Deep Learning in Solar Cell Defectiveness Detection, Image Analysis
  • Imbalanced Data and Rare Event Prediction in ML/DL
  • Metrics, Evaluation of LLMs

Education

Virginia Tech

PhD in Statistics

Research of Interest: Advanced Bayesian Statistics|Rare Event Prediction | Deep Learning | Evaluation of Large Language Models in Statistics

Thesis: Statistical Methods for Performance Evaluation of Machine Learning and Artificial Intelligence Models (Advisor: Dr. Yili Hong)
Teaching Experience: Lecturer (STAT 2274 Basic Python for Statistics) - Jan 2025 - Present
Aug 2020 - Jun 2025 (Expected)

GPA:3.9/4.0


University of Illinois at Urbana-Champaign

Master of Science in Statistics
Relevant Coursework - |Statistical Learning|Data Science Foundation | Categorical Data Analysis | Statistical Inference | Statistical Computing | Natural Language Processing
Aug 2018 - May 2020

GPA:3.91/4.0


China University of Geosciences

Bachelor of Science in Statistics
Sep 2013 - Jun 2017

GPA:93/100



Experience

Data Scientist Intern

Microsoft
  • Cohort Analysis: Track Micorsoft Account (MSA)daily engagement trends, retention, and churn.
  • ML & DL Models: Predict churn, detect behavioral patterns, and enable proactive interventions.
  • AI-Driven Insights: analyze 100+ parameters for dynamic cohort segmentation.

May 2024 - Aug 2024
Redmond, WA 98052


Decision Science Lab - AI Lab Innovation & Risk

American Express

Project – Uncertainty Quantification of Machine Learning Model Output based on Dirichlet Prior Networks (DPN)

  • Pre-processed Customer Default Swaps (CDSS) data (870.9M obs, 255 variables, imbalanced ratio: 0.011) in PySpark via Databricks, trained DPN to evaluate uncertainty of predicted default probability in PyTorch (Gini Score 92.2%).
  • Quantified distributional uncertainty using DPN model to identify ‘severely’ misclassified defaults with high credit bill balance but low CDSS score (predicted default probability) and detected model degradation problem.

Jun 2023 - Aug 2023
New York, NY 11025


Data Scientist Intern

John Deere
  • Integrated large spatial data sets with Spark SQL and support key analytics project.
  • Designed neural networks in PyTorch to detect crop rows from images using semantic graphics with accuracy of 69.5%.

Feb 2020 - May 2020
Champaign, IL 61820


Data Scientist Intern

TikTok
  • Mined ads datasets (larger than 8TB) with Hive and created Tableau dashboards for abnormal detection.
  • Implemented a call-to-action (CTA) button by A/B testing, yielding +10% conversion and +5% customer retention.

Jun 2019 - Aug 2019
Beijing, China


Research Assistant

University of Illinois at Urbana-Champaign

Social Media Data Collection and Analysis -- Funded Research Project of Stockholm University, Sweden advised by Prof. Brian Deal at UIUC. Advisor: Prof. Brian Deal and Prof. Si Chen

  • Crawled and pre-processed geographic, text and rating data of Stockholm and Chicago from Yelp and Google API.
  • Applied embeddings and text mining algorithms (eg. SVM, XGBoost, Random Forest, CNN, LSTM, BERT) for text classification to extract popularity and functionality based on 11933 reviews of 2800 POIs in Chicago.
  • BERT performed best in popularity and activity type (functionality) prediction with accuracy 84% and 80.11%. Random Forest based on Word2Vec achieved accuracy of 80.96% for restaurant type prediction (functionality).

Mar 2020 - Dec 2020
Urbana, IL 61821


Statistical Consultant

University of Illinois at Urbana-Champaign

Refined Models of iBeacon-Based Indoor Positioning System in Undergraduate Library (UGL) of UIUC. Advisor: Prof. Jim Hahn

  • Tried Weighted Trilateration Algorithm to remove noise of samples and reduced 30% running time of models on average.
  • Lead the whole team to train K-Nearest Neighbors (KNN), Random Forest, SVM, Naïve Bayes, Gaussian Mixture Model, Neural Network and Random Forest, improving locating accuracy from 56.6% to 89.9% by tuning parameters with Python.

Jan 2019 - May 2019
Urbana, IL 61821



Publications

A comprehensive case study on the performance of machine learning methods on the classification of solar panel electroluminescence images


Xinyi Song*, Kennedy Odongo, Francis G Pascual, Yili Hong [Paper]

Photovoltaics (PV) are widely used to harvest solar energy, an important form of renewable energy. Photovoltaic arrays consist of multiple solar panels constructed from solar cells. Solar cells in the field are vulnerable to various defects, and electroluminescence (EL) imaging provides effective and nondestructive diagnostics to detect those defects. We use multiple traditional machine learning and modern deep learning models to classify EL solar cell images into different functional/defective categories. Because of the asymmetry in the number of functional versus defective cells, an imbalanced label problem arises in the EL image data. The current literature lacks insights on which methods and metrics to use for model training and prediction. In this article, we comprehensively compare different machine learning and deep learning methods under different performance metrics on the classification of solar cell EL images from monocrystalline and polycrystalline modules. We provide a comprehensive discussion on different metrics. Our results provide insights and guidelines for practitioners in selecting prediction methods and performance metrics.

2024



Performance Evaluation of Large Language Models in Statistical Programming


Xinyi Song*, Kexin Xie, Lina Lee, Ruizhe Chen, Jared M. Clark, Hao He, Haoran He, Jie Min, Xinlei Zhang, Simin Zheng, Zhiyang Zhang, Xinwei Deng, and Yili Hong [Paper]

The programming capabilities of large language models (LLMs) have revolutionized automatic code generation and opened new avenues for automatic statistical analysis. However, the validity and quality of these generated codes need to be systematically evaluated before they can be widely adopted. Despite their growing prominence, a comprehensive evaluation of statistical code generated by LLMs remains scarce in the literature. In this paper, we assess the performance of LLMs, including two versions of ChatGPT and one version of Llama, in the domain of SAS programming for statisti- cal analysis. Our study utilizes a set of statistical analysis tasks encompassing diverse statistical topics and datasets. Each task includes a problem description, dataset in- formation, and human-verified SAS code. We conduct a comprehensive assessment of the quality of SAS code generated by LLMs through human expert evaluation based on correctness, effectiveness, readability, executability, and the accuracy of output re- sults. The analysis of rating scores reveals that while LLMs demonstrate usefulness in generating syntactically correct code, they struggle with tasks requiring deep domain understanding and may produce redundant or incorrect results. This study offers valu- able insights into the capabilities and limitations of LLMs in statistical programming, providing guidance for future advancements in AI-assisted coding systems for statistical analysis.

2025



Applied Statistics in the Era of Artificial Intelligence: A Review and Vision


Jie Min*, Xinyi Song, Simin Zheng, Caleb B. King, Xinwei Deng, Yili Hong.

The advent of artificial intelligence (AI) technologies has significantly changed many domains, including applied statistics. This review and vision paper explores the evolving role of applied statistics in the AI era, drawing from our experiences in engineering statistics. We begin by outlining the fundamental concepts and historical developments in applied statistics and tracing the rise of AI technologies. Subsequently, we review traditional areas of applied statistics, using examples from engineering statistics to illustrate key points. We then explore emerging areas in applied statistics, driven by recent technological advancements, highlighting examples from our recent projects. The paper discusses the symbiotic relationship between AI and applied statistics, focusing on how statistical principles can be employed to study the properties of AI models and enhance AI systems. We also examine how AI can advance applied statistics in terms of modeling and analysis. In conclusion, we reflect on the future role of statisticians. Our paper aims to shed light on the transformative impact of AI on applied statistics and inspire further exploration in this dynamic field.

2024



* denotes First Author


Skills

Maths and AI Theory

Deep Learning and ML Frameworks

Languages and Operating System

Database Technologies