publications | Sungmin Kang

2025

Predictive Prompt Analysis

Jae Yong Lee, Sungmin Kang, and Shin Yoo

In 2025 International Conference on the Foundations of Software Engineering (FSE) Ideas, Visions and Reflections Track, 2025

Abs Bib

Large Language Models (LLMs) are machine learning models that have seen widespread adoption due to their capability of handling previously difficult tasks. LLMs, due to their training, are sensitive to how exactly a question is presented, also known as prompting. However, prompting well is challenging, as it has been difficult to uncover principles behind prompting – generally, trial-and-error is the most common way of improving prompts, despite its significant computational cost. In this context, we argue it would be useful to perform ‘predictive prompt analysis’, in which an automated technique would perform a quick analysis of a prompt and predict how the LLM would react to it, relative to a goal provided by the user. As a demonstration of the concept, we present Syntactic Prevalence Analyzer (SPA), a predictive prompt analysis approach based on sparse autoencoders (SAEs). SPA accurately predicted how often an LLM would generate target syntactic structures during code synthesis, with up to 0.994 Pearson correlation between the predicted and actual prevalence of the target structure. At the same time, SPA requires only 0.4% of the time it takes to run the LLM on a benchmark. As LLMs are increasingly used during and integrated into modern software development, our proposed predictive prompt analysis concept has the potential to significantly ease the use of LLMs for both practitioners and researchers.
@inproceedings{lee2025ppa, title = {Predictive Prompt Analysis}, author = {Lee, Jae Yong and Kang, Sungmin and Yoo, Shin}, year = {2025}, booktitle = {2025 International Conference on the Foundations of Software Engineering (FSE) Ideas, Visions and Reflections Track}, }
Beyond pip install: Evaluating LLM Agents for the Automated Installation of Python Projects

Louis Milliken, Sungmin Kang, and Shin Yoo

In 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2025

Abs Bib

Many works have recently proposed the use of Large Language Model (LLM) based agents for performing ’repository level’ tasks, loosely defined as a set of tasks whose scopes are greater than a single file. This has led to speculation that the orchestration of these repository-level tasks could lead to software engineering agents capable of performing almost independently of human intervention. However, of the suite of tasks that would need to be performed by this autonomous software engineering agent, we argue that one important task is missing, which is to fulfil project level dependency by installing other repositories. To investigate the feasibility of this repository level installation task, we introduce a benchmark of of repository installation tasks curated from 40 open source Python projects, which includes a ground truth installation process for each target repository. Further, we propose INSTALLAMATIC, an agent whose goal is to perform and verify the installation of a given repository by searching for relevant instructions from documentations in the repository. We evaluate our agent using this benchmark, and report that 55% of the studied repositories can be automatically installed by our agent. Through the results of our empirical evaluation, we identify the common causes for our agent’s inability to install a repository, discuss the challenges faced in the design and implementation of such an agent and consider the implications that such an agent could have for developers.
@inproceedings{milliken2025installamatic, title = {Beyond pip install: Evaluating LLM Agents for the Automated Installation of Python Projects}, author = {Milliken, Louis and Kang, Sungmin and Yoo, Shin}, year = {2025}, booktitle = {2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)}, }

2024

Identifying Inaccurate Descriptions in LLM-generated Code Comments via Test Execution

Sungmin Kang, Louis Milliken, and Shin Yoo

2024

Abs Bib PDF

Software comments are critical for human understanding of software, and as such many comment generation techniques have been proposed. However, we find that a systematic evaluation of the factual accuracy of generated comments is rare; only subjective accuracy labels have been given. Evaluating comments generated by three Large Language Models (LLMs), we find that even for the best-performing LLM, roughly a fifth of its comments contained demonstrably inaccurate statements. While it seems code-comment consistency detection techniques should be able to detect inaccurate comments, we perform experiments demonstrating they have no statistically significant relationship with comment accuracy, underscoring the substantial difficulty of this problem. To tackle this, we propose the concept of document testing, in which a document is verified by using an LLM to generate tests based on the document, running those tests, and observing whether they pass or fail. Furthermore, we implement our concept to verify Java comments. Experiments demonstrate that our approach has a robust statistical relationship with comment accuracy, making headway into a problem where prior techniques failed. Qualitative evaluation also reveals the promise of our approach in gaining developer trust, while highlighting the limitations of our current implementation.
@misc{kang2024doctesting, title = {Identifying Inaccurate Descriptions in LLM-generated Code Comments via Test Execution}, author = {Kang, Sungmin and Milliken, Louis and Yoo, Shin}, year = {2024}, eprint = {2406.14836}, archiveprefix = {arXiv}, primaryclass = {cs.SE}, }
A Quantitative and Qualitative Evaluation of LLM-based Explainable Fault Localization

Sungmin Kang, Gabin An, and Shin Yoo

Proceedings of the ACM on Software Engineering, 2024

Abs Bib PDF Code Slides

Fault Localization (FL), in which a developer seeks to identify which part of the code is malfunctioning and needs to be fixed, is a recurring challenge in debugging. To reduce developer burden, many automated FL techniques have been proposed. However, prior work has noted that existing techniques fail to provide rationales for the suggested locations, hindering developer adoption of these techniques. With this in mind, we propose AutoFL, a Large Language Model (LLM)-based FL technique that generates an explanation of the bug along with a suggested fault location. AutoFL prompts an LLM to use function calls to navigate a repository, so that it can effectively localize faults over a large software repository and overcome the limit of the LLM context length. Extensive experiments on 798 real-world bugs in Java and Python reveal AutoFL improves method-level acc@1 by up to 233.3% over baselines. Furthermore, developers were interviewed on their impression of AutoFL-generated explanations, showing that developers generally liked the natural language explanations of AutoFL, and that they preferred reading a few, high-quality explanations instead of many.
@article{kang2024autofl, title = {A Quantitative and Qualitative Evaluation of LLM-based Explainable Fault Localization}, journal = {Proceedings of the ACM on Software Engineering}, number = {FSE}, pages = {1424-1446}, volume = {1}, issue = {FSE}, author = {Kang, Sungmin and An, Gabin and Yoo, Shin}, year = {2024}, }
Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction

Sungmin Kang, Juyeon Yoon, Nargiz Askarbekkyzy, and 1 more author

IEEE Transactions on Software Engineering, 2024

Abs Bib PDF Code

Bug reproduction is a critical developer activity that is also challenging to automate, as bug reports are often in natural language and thus can be difficult to transform to test cases consistently. As a result, existing techniques mostly focused on crash bugs, which are easier to automatically detect and verify. In this work, we overcome this limitation by using large language models (LLMs), which have been demonstrated to be adept at natural language processing and code generation. By prompting LLMs to generate bug-reproducing tests, and via a post-processing pipeline to automatically identify promising generated tests, our proposed technique LIBRO could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark. Furthermore, our extensive evaluation on 15 LLMs, including 11 open-source LLMs, suggests that open-source LLMs also demonstrate substantial potential, with the StarCoder LLM achieving 70% of the reproduction performance of the closed-source OpenAI LLM code-davinci-002 on the large Defects4J benchmark, and 90% of performance on a held-out bug dataset likely not part of any LLM’s training data. In addition, our experiments on LLMs of different sizes show that bug reproduction using LIBRO improves as LLM size increases, providing information as to which LLMs can be used with the LIBRO pipeline.
@article{kang2024evaluating, title = {Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction}, author = {Kang, Sungmin and Yoon, Juyeon and Askarbekkyzy, Nargiz and Yoo, Shin}, year = {2024}, journal = {{IEEE} {T}ransactions on {S}oftware {E}ngineering}, }
The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging Applications

Jae Yong Lee, Sungmin Kang, Juyeon Yoon, and 1 more author

2024

Abs Bib PDF Code

Large Language Models (LLMs) have demonstrated strong natural language processing and code synthesis capabilities, which has led to their rapid adoption in software engineering applications. However, details about LLM training data are often not made public, which has caused concern as to whether existing bug benchmarks are included. In lieu of the training data for the popular GPT models, we examine the training data of the open-source LLM StarCoder, and find it likely that data from the widely used Defects4J benchmark was included, raising the possibility of its inclusion in GPT training data as well. This makes it difficult to tell how well LLM-based results on Defects4J would generalize, as for any results it would be unclear whether a technique’s performance is due to LLM generalization or memorization. To remedy this issue and facilitate continued research on LLM-based SE, we present the GitHub Recent Bugs (GHRB) dataset, which includes 76 real-world Java bugs that were gathered after the OpenAI data cut-off point.
@misc{lee2023github, title = {The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging Applications}, author = {Lee, Jae Yong and Kang, Sungmin and Yoon, Juyeon and Yoo, Shin}, booktitle = {2024 IEEE Conference on Software Testing, Verification and Validation Demo Track (ICST)}, series = {ICST 2024}, year = {2024}, }

2023

Deceiving Humans and Machines Alike: Search-based Test Input Generation for DNNs using Variational Autoencoders

Sungmin Kang, Robert Feldt, and Shin Yoo

ACM Transactions on Software Engineering and Methodology, 2023

Abs Bib PDF

Due to the rapid adoption of Deep Neural Networks (DNNs) into larger software systems, testing of DNN based systems has received much attention recently. While many different test adequacy criteria have been suggested, we lack effective test input generation techniques. Inputs such as images of real world objects and scenes are not only expensive to collect but also difficult to randomly sample. Consequently, current testing techniques for DNNs tend to apply small local perturbations to existing inputs to generate new inputs. We propose SINVAD, a way to sample from, and navigate over, a space of realistic inputs that resembles the true distribution in the training data. Our input space is constructed using Variational AutoEncoders (VAEs), and navigated through their latent vector space. Our analysis shows that the VAE-based input space is well-aligned with human perception of what constitutes realistic inputs. Further, we show that this space can be effectively searched to achieve various testing scenarios, such as boundary testing of two different DNNs or analyzing class labels that are difficult for the given DNN to distinguish. Guidelines on how to design VAE architectures are presented as well. Our results have the potential to open the field to meaningful exploration through the space of highly structured images.
@article{kang2023deceiving, author = {Kang, Sungmin and Feldt, Robert and Yoo, Shin}, journal = {ACM Transactions on Software Engineering and Methodology}, year = {2023}, publisher = {ACM New York, NY}, }
Towards Autonomous Testing Agents via Conversational Large Language Models

Robert Feldt, Sungmin Kang, Juyeon Yoon, and 1 more author

In Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering, 2023

Abs Bib PDF

Software testing is an important part of the development cycle, yet it requires specialized expertise and substantial developer effort to adequately test software. The recent discoveries of the capabilities of large language models (LLMs) suggest that they can be used as automated testing assistants, and thus provide helpful information and even drive the testing process. To highlight the potential of this technology, we present a taxonomy of LLM-based testing agents based on their level of autonomy, and describe how a greater level of autonomy can benefit developers in practice. An example use of LLMs as a testing assistant is provided to demonstrate how a conversational framework for testing can help developers. This also highlights how the often criticized hallucination of LLMs can be beneficial while testing. We identify other tangible benefits that LLM-driven testing agents can bestow, and also discuss some potential limitations.
@inproceedings{feldt2023autonomous, title = {Towards Autonomous Testing Agents via Conversational Large Language Models}, author = {Feldt, Robert and Kang, Sungmin and Yoon, Juyeon and Yoo, Shin}, booktitle = {Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering}, series = {ASE 2023}, year = {2023}, }
Explainable Automated Debugging via Large Language Model-driven Scientific Debugging

Sungmin Kang, Bei Chen, Shin Yoo, and 1 more author

2023

Abs Bib PDF

Automated debugging techniques have the potential to reduce developer effort in debugging, and have matured enough to be adopted by industry. However, one critical issue with existing techniques is that, while developers want rationales for the provided automatic debugging results, existing techniques are ill-suited to provide them, as their deduction process differs significantly from that of human developers. Inspired by the way developers interact with code when debugging, we propose Automated Scientific Debugging (AutoSD), a technique that given buggy code and a bug-revealing test, prompts large language models to automatically generate hypotheses, uses debuggers to actively interact with buggy code, and thus automatically reach conclusions prior to patch generation. By aligning the reasoning of automated debugging more closely with that of human developers, we aim to produce intelligible explanations of how a specific patch has been generated, with the hope that the explanation will lead to more efficient and accurate developer decisions. Our empirical analysis on three program repair benchmarks shows that AutoSD performs competitively with other program repair baselines, and that it can indicate when it is confident in its results. Furthermore, we perform a human study with 20 participants, including six professional developers, to evaluate the utility of explanations from AutoSD. Participants with access to explanations could judge patch correctness in roughly the same time as those without, but their accuracy improved for five out of six real-world bugs studied: 70% of participants answered that they wanted explanations when using repair tools, while 55% answered that they were satisfied with the Scientific Debugging presentation.
@misc{kang2023explainable, title = {Explainable Automated Debugging via Large Language Model-driven Scientific Debugging}, author = {Kang, Sungmin and Chen, Bei and Yoo, Shin and Lou, Jian-Guang}, year = {2023}, eprint = {2304.02195}, archiveprefix = {arXiv}, primaryclass = {cs.SE}, }
A Bayesian Framework for Automated Debugging

Sungmin Kang, Wonkeun Choi, and Shin Yoo

In Proceedings of the 32nd International Symposium on Software Testing and Analysis, 2023

Abs Bib PDF Slides

Debugging takes up a significant portion of developer time. As a result, automated debugging techniques including Fault Localization (FL) and Automated Program Repair (APR) have garnered significant attention due to their potential to aid developers in debugging tasks. With the recent advance in techniques that treat the two tasks as closely coupled, such as Unified Debugging, a framework to formally express these two tasks together would heighten our understanding of automated debugging and provide a way to formally analyze techniques and approaches. To this end, we propose a Bayesian framework of understanding automated debugging. We find that the Bayesian framework, along with a concrete statement of the objective of automated debugging, can recover maximal fault localization formulae from prior work, as well as analyze existing APR techniques and their underlying assumptions. As a means of empirically demonstrating our framework, we further propose BAPP, a Bayesian Patch Prioritization technique that incorporates intermediate program values to analyze likely patch locations and repair actions, with its core equations being derived by our Bayesian framework. We find that incorporating program values allows BAPP to identify correct patches more precisely: the rankings produced by BAPP reduced the number of required patch evaluations by 68% and consequently reduced the repair time by 34 minutes on average. Further, our Bayesian framework suggests a number of changes to the way fault localization information is used in program repair, which we validate is useful for BAPP. These results highlight the potential of value-cognizant automated debugging techniques, and further verifies our theoretical framework.
@inproceedings{Kang2023bb, author = {Kang, Sungmin and Choi, Wonkeun and Yoo, Shin}, booktitle = {Proceedings of the 32nd International Symposium on Software Testing and Analysis}, series = {ISSTA 2023}, title = {A Bayesian Framework for Automated Debugging}, year = {2023}, }
Towards Objective-Tailored Genetic Improvement Through Large Language Models

Sungmin Kang, and Shin Yoo

In Proceedings of the 12th International Workshop on Genetic Improvement, 2023

Abs Bib PDF

While Genetic Improvement (GI) is a useful paradigm to improve functional and nonfunctional aspects of software, existing techniques tended to use the same set of mutation operators for differing objectives, due to the difficulty of writing custom mutation operators. In this work, we suggest that Large Language Models (LLMs) can be used to generate objective-tailored mutants, expanding the possibilities of software optimizations that GI can perform. We further argue that LLMs and the GI process can benefit from the strengths of one another, and present a simple example demonstrating that LLMs can both improve the effectiveness of the GI optimization process, while also benefiting from the evaluation steps of GI. As a result, we believe that the combination of LLMs and GI has the capability to significantly aid developers in optimizing their software.
@inproceedings{Kang2023lg, author = {Kang, Sungmin and Yoo, Shin}, booktitle = {Proceedings of the 12th International Workshop on Genetic Improvement}, series = {GI 2023}, title = {Towards Objective-Tailored Genetic Improvement Through Large Language Models}, year = {2023}, }
Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction

Sungmin Kang, Juyeon Yoon, and Shin Yoo

In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering, 2023

Abs Bib PDF Blog Code Slides

Many automated test generation techniques have been developed to aid developers with writing tests. To facilitate full automation, most existing techniques aim to either increase coverage, or generate exploratory inputs. However, existing test generation techniques largely fall short of achieving more semantic objectives, such as generating tests to reproduce a given bug report. Reproducing bugs is nonetheless important, as our empirical study shows that the number of tests added in open source repositories due to issues was about 28% of the corresponding project test suite size. Meanwhile, due to the difficulties of transforming the expected program semantics in bug reports into test oracles, existing failure reproduction techniques tend to deal exclusively with program crashes, a small subset of all bug reports. To automate test generation from general bug reports, we propose LIBRO, a framework that uses Large Language Models (LLMs), which have been shown to be capable of performing code-related tasks. Since LLMs themselves cannot execute the target buggy code, we focus on post-processing steps that help us discern when LLMs are effective, and rank the produced tests according to their validity. Our evaluation of LIBRO shows that, on the widely studied Defects4J benchmark, LIBRO can generate failure reproducing test cases for 33% of all studied cases (251 out of 750), while suggesting a bug reproducing test in first place for 149 bugs. To mitigate data contamination, we also evaluate LIBRO against 31 bug reports submitted after the collection of the LLM training data terminated: LIBRO produces bug reproducing tests for 32% of the studied bug reports. Overall, our results show LIBRO has the potential to significantly enhance developer efficiency by automatically generating tests from bug reports.
@inproceedings{Kang2023aa, author = {Kang, Sungmin and Yoon, Juyeon and Yoo, Shin}, booktitle = {Proceedings of the 45th IEEE/ACM International Conference on Software Engineering}, series = {ICSE 2023}, title = {Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction}, year = {2023} }
Arachne: Search Based Repair of Deep Neural Networks

Jeongju Sohn, Sungmin Kang, and Shin Yoo

ACM Transactions on Software Engineering Methodology, 2023

Abs Bib PDF

The rapid and widespread adoption of Deep Neural Networks (DNNs) has called for ways to test their behaviour, and many testing approaches have successfully revealed misbehaviour of DNNs. However, it is relatively unclear what one can do to correct such behaviour after revelation, as retraining involves costly data collection and does not guarantee to fix the underlying issue. This article introduces Arachne, a novel program repair technique for DNNs, which directly repairs DNNs using their input-output pairs as a specification. Arachne localises neural weights on which it can generate effective patches and uses differential evolution to optimise the localised weights and correct the misbehaviour. An empirical study using different benchmarks shows that Arachne can fix specific misclassifications of a DNN without reducing general accuracy significantly. On average, patches generated by Arachne generalise to 61.3% of unseen misbehaviour, whereas those by a state-of-the-art DNN repair technique generalise only to 10.2% and sometimes to none while taking tens of times more than Arachne. We also show that Arachne can address fairness issues by debiasing a gender classification model. Finally, we successfully apply Arachne to a text sentiment model to show that it generalises beyond convolutional neural networks.
@article{Sohn2022cr, author = {Sohn, Jeongju and Kang, Sungmin and Yoo, Shin}, date-modified = {2023-05-29 12:26:00 +0900}, journal = {{ACM} {T}ransactions on {S}oftware {E}ngineering {M}ethodology}, number = {4}, pages = {1--26}, title = {Arachne: Search Based Repair of Deep Neural Networks}, volume = {32}, year = {2023}, }

2022

Language Models Can Prioritize Patches for Practical Program Patching

Sungmin Kang, and Shin Yoo

In Proceedings of the 3rd International Workshop on Automated Program Repair, 2022

Bib

@inproceedings{Kang2022kl,
  author = {Kang, Sungmin and Yoo, Shin},
  booktitle = {Proceedings of the 3rd International Workshop on Automated Program Repair},
  pages = {8--15},
  series = {APR 2022},
  title = {Language Models Can Prioritize Patches for Practical Program Patching},
  year = {2022}
}

2020

SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test Input Generation

Sungmin Kang, Robert Feldt, and Shin Yoo

In Proceedings of the International Workshop on Search Based Software Testing, 2020

Bib

@inproceedings{Kang2020aa,
  author = {Kang, Sungmin and Feldt, Robert and Yoo, Shin},
  booktitle = {Proceedings of the International Workshop on Search Based Software Testing},
  pages = {521-528},
  series = {SBST 2020},
  title = {SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test Input Generation},
  year = {2020}
}