Xinyi Song

Experience

Data Scientist Intern

Microsoft

Cohort Analysis: Track Micorsoft Account (MSA)daily engagement trends, retention, and churn.
ML & DL Models: Predict churn, detect behavioral patterns, and enable proactive interventions.
AI-Driven Insights: analyze 100+ parameters for dynamic cohort segmentation.

May 2024 - Aug 2024
Redmond, WA 98052

Decision Science Lab - AI Lab Innovation & Risk

American Express

Project – Uncertainty Quantification of Machine Learning Model Output based on Dirichlet Prior Networks (DPN)

Pre-processed Customer Default Swaps (CDSS) data (870.9M obs, 255 variables, imbalanced ratio: 0.011) in PySpark via Databricks, trained DPN to evaluate uncertainty of predicted default probability in PyTorch (Gini Score 92.2%).
Quantified distributional uncertainty using DPN model to identify ‘severely’ misclassified defaults with high credit bill balance but low CDSS score (predicted default probability) and detected model degradation problem.

Jun 2023 - Aug 2023
New York, NY 11025

Data Scientist Intern

John Deere

Integrated large spatial data sets with Spark SQL and support key analytics project.
Designed neural networks in PyTorch to detect crop rows from images using semantic graphics with accuracy of 69.5%.

Feb 2020 - May 2020
Champaign, IL 61820

Data Scientist Intern

TikTok

Mined ads datasets (larger than 8TB) with Hive and created Tableau dashboards for abnormal detection.
Implemented a call-to-action (CTA) button by A/B testing, yielding +10% conversion and +5% customer retention.

Jun 2019 - Aug 2019
Beijing, China

Research Assistant

University of Illinois at Urbana-Champaign

Social Media Data Collection and Analysis -- Funded Research Project of Stockholm University, Sweden advised by Prof. Brian Deal at UIUC. Advisor: Prof. Brian Deal and Prof. Si Chen

Crawled and pre-processed geographic, text and rating data of Stockholm and Chicago from Yelp and Google API.
Applied embeddings and text mining algorithms (eg. SVM, XGBoost, Random Forest, CNN, LSTM, BERT) for text classification to extract popularity and functionality based on 11933 reviews of 2800 POIs in Chicago.
BERT performed best in popularity and activity type (functionality) prediction with accuracy 84% and 80.11%. Random Forest based on Word2Vec achieved accuracy of 80.96% for restaurant type prediction (functionality).

Mar 2020 - Dec 2020
Urbana, IL 61821

Statistical Consultant

University of Illinois at Urbana-Champaign

Refined Models of iBeacon-Based Indoor Positioning System in Undergraduate Library (UGL) of UIUC. Advisor: Prof. Jim Hahn

Tried Weighted Trilateration Algorithm to remove noise of samples and reduced 30% running time of models on average.
Lead the whole team to train K-Nearest Neighbors (KNN), Random Forest, SVM, Naïve Bayes, Gaussian Mixture Model, Neural Network and Random Forest, improving locating accuracy from 56.6% to 89.9% by tuning parameters with Python.

Jan 2019 - May 2019
Urbana, IL 61821

Publications

A comprehensive case study on the performance of machine learning methods on the classification of solar panel electroluminescence images

Xinyi Song*, Kennedy Odongo, Francis G Pascual, Yili Hong [Paper]

Photovoltaics (PV) are widely used to harvest solar energy, an important form of renewable energy. Photovoltaic arrays consist of multiple solar panels constructed from solar cells. Solar cells in the field are vulnerable to various defects, and electroluminescence (EL) imaging provides effective and nondestructive diagnostics to detect those defects. We use multiple traditional machine learning and modern deep learning models to classify EL solar cell images into different functional/defective categories. Because of the asymmetry in the number of functional versus defective cells, an imbalanced label problem arises in the EL image data. The current literature lacks insights on which methods and metrics to use for model training and prediction. In this article, we comprehensively compare different machine learning and deep learning methods under different performance metrics on the classification of solar cell EL images from monocrystalline and polycrystalline modules. We provide a comprehensive discussion on different metrics. Our results provide insights and guidelines for practitioners in selecting prediction methods and performance metrics.

2024

Performance Evaluation of Large Language Models in Statistical Programming

Xinyi Song*, Kexin Xie, Lina Lee, Ruizhe Chen, Jared M. Clark, Hao He, Haoran He, Jie Min, Xinlei Zhang, Simin Zheng, Zhiyang Zhang, Xinwei Deng, and Yili Hong [Paper]

The programming capabilities of large language models (LLMs) have revolutionized automatic code generation and opened new avenues for automatic statistical analysis. However, the validity and quality of these generated codes need to be systematically evaluated before they can be widely adopted. Despite their growing prominence, a comprehensive evaluation of statistical code generated by LLMs remains scarce in the literature. In this paper, we assess the performance of LLMs, including two versions of ChatGPT and one version of Llama, in the domain of SAS programming for statisti- cal analysis. Our study utilizes a set of statistical analysis tasks encompassing diverse statistical topics and datasets. Each task includes a problem description, dataset in- formation, and human-verified SAS code. We conduct a comprehensive assessment of the quality of SAS code generated by LLMs through human expert evaluation based on correctness, effectiveness, readability, executability, and the accuracy of output re- sults. The analysis of rating scores reveals that while LLMs demonstrate usefulness in generating syntactically correct code, they struggle with tasks requiring deep domain understanding and may produce redundant or incorrect results. This study offers valu- able insights into the capabilities and limitations of LLMs in statistical programming, providing guidance for future advancements in AI-assisted coding systems for statistical analysis.

2025

Applied Statistics in the Era of Artificial Intelligence: A Review and Vision

Jie Min*, Xinyi Song, Simin Zheng, Caleb B. King, Xinwei Deng, Yili Hong.

The advent of artificial intelligence (AI) technologies has significantly changed many domains, including applied statistics. This review and vision paper explores the evolving role of applied statistics in the AI era, drawing from our experiences in engineering statistics. We begin by outlining the fundamental concepts and historical developments in applied statistics and tracing the rise of AI technologies. Subsequently, we review traditional areas of applied statistics, using examples from engineering statistics to illustrate key points. We then explore emerging areas in applied statistics, driven by recent technological advancements, highlighting examples from our recent projects. The paper discusses the symbiotic relationship between AI and applied statistics, focusing on how statistical principles can be employed to study the properties of AI models and enhance AI systems. We also examine how AI can advance applied statistics in terms of modeling and analysis. In conclusion, we reflect on the future role of statisticians. Our paper aims to shed light on the transformative impact of AI on applied statistics and inspire further exploration in this dynamic field.

2024

* denotes First Author

Xinyi Song

Education

Virginia Tech

University of Illinois at Urbana-Champaign

China University of Geosciences

Experience

Data Scientist Intern

Decision Science Lab - AI Lab Innovation & Risk

Data Scientist Intern

Data Scientist Intern

Research Assistant

Statistical Consultant

Publications

A comprehensive case study on the performance of machine learning methods on the classification of solar panel electroluminescence images

Performance Evaluation of Large Language Models in Statistical Programming

Applied Statistics in the Era of Artificial Intelligence: A Review and Vision

Skills