Publications

You can also find my articles on my Google Scholar profile.

Conference Papers

Datalog with First-Class Facts

Published in Proceedings of the VLDB Endowment, Volume 18, 2025

State-of-the-art Datalog engines include expressive features such as ADTs (structured heap values), stratified aggregation and negation, various primitive operations, and the opportunity for further extension using FFIs. Current parallelization approaches for state-of-art Datalogs target shared-memory locking data-structures using conventional multi-threading, or use the map-reduce model for distributed computing. Furthermore, current state-of-art approaches cannot scale to formal systems which pervasively manipulate structured data due to their lack of indexing for structured data stored in the heap. In this paper, we describe a new approach to data-parallel structured deduction that involves a key semantic extension of Datalog to permit first-class facts and higher-order relations via defunctionalization, an imple- mentation approach that enables parallelism uniformly both across sets of disjoint facts and over individual facts with nested structure. We detail a core language, DL𝑠 , whose key invariant (subfact closure) ensures that each subfact is materialized as a top-class fact. We extend DL𝑠 to Slog, a fully-featured language whose forms facilitate leveraging subfact closure to rapidly implement expressive, high-performance formal systems. We demonstrate Slog by building a family of control-flow analyses from abstract machines, systematically, along with several implementations of classical type systems (such as STLC and LF). We performed experiments on EC2, Azure, and ALCF’s Theta at up to 1000 threads, showing orders-of-magnitude scalability improvements versus competing state-of-art systems.

Recommended citation: G. Thomas, S. Arash, S. Yihao, K. Sowmith, K. Sidharth, and M. Kristopher, “Datalog with First-Class Facts,” in Proceedings of the VLDB Endowment, Volume 18, London, United Kingdom, 2025.
Download Paper

Column-Oriented Datalog on the GPU

Published in The 39th Annual AAAI Conference on Artificial Intelligence, 2025

Datalog is a logic programming language widely used in knowledge representation and reasoning (KRR), program analysis, and social media mining due to its expressiveness and high performance. Traditionally, Datalog engines use either row-oriented or column-oriented storage. Engines like VLog and Nemo favor column-oriented storage for efficiency on limited-resource machines, while row-oriented engines like Souffle use advanced data structures with locking to perform better on multi-core CPUs. The advent of modern datacenter GPUs, such as the NVIDIA H100 with its ability to run over 16k threads simultaneously and high memory bandwidth, has reopened the debate on which storage layout is more effective. This paper presents the first column-oriented Datalog engines tailored to the strengths of modern GPUs. We present VFLog, a CUDA-based Datalog runtime library with a column-oriented GPU datastructure that supports all necessary relational algebra operations. Our results demonstrate over 200x performance gains over SOTA CPU-based column-oriented Datalog engines and a 2.5x speedup over GPU Datalog engines in various workloads, including KRR.

Optimizing Datalog for the GPU

Published in International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS), 2025

Modern Datalog engines (e.g., LogicBlox, Soufflé, ddlog) enable their users to write declarative queries which compute recursive deductions over extensional facts, leaving high-performance operationalization (query planning, semi-naïve evaluation, and parallelization) to the engine. Such engines form the backbone of modern high-throughput applications in static analysis, network monitoring, and social-media mining. In this paper, we present a methodology for implementing a modern in-memory Datalog engine on data center GPUs, allowing us to achieve significant (up to 45x) gains compared to Soufflé (a modern CPU-based engine) on context-sensitive points-to analysis of httpd. We present GPUlog, a Datalog engine backend that implements iterated relational algebra kernels over a novel range-indexed data structure we call the hash-indexed sorted array (HISA). HISA combines the algorithmic benefits of incremental range-indexed relations with the raw computation throughput of operations over dense data structures. Our experiments show that GPUlog is significantly faster than CPU-based Datalog engines while achieving a favorable memory footprint compared to contemporary GPU-based joins.

Recommended citation: Y. Sun, S. Ahmedur, G. Thomas, M. Kristopher, and K. Sidharth, “Optimizing Datalog for the GPU,” in International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS), Rotterdam, The Netherlands, 2025.
Download Paper

Assemblage: Automatic Binary Dataset Construction for Machine Learning

Published in NeurIPS 2024 Datasets and Benchmarks Track, 2024

Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpora of malicious binaries, obtaining high-quality corpora of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpora (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities. To address these issues, we present ASSEMBLAGE: an extensible distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpora suitable for training state-of-the-art models in binary analysis. We have run ASSEMBLAGE on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations. ASSEMBLAGE is designed to be both reproducible and extensible, enabling users to publish “recipes” for their datasets, and facilitating the extraction of a wide array of features. We evaluated ASSEMBLAGE by using its data to train modern learning-based pipelines for compiler provenance and binary function similarity. Our results illustrate the practical need for robust corpora of high-quality Windows PE binaries in training modern learning-based binary analyses. ASSEMBLAGE code is open sourced under the MIT license, and the dataset can be downloaded from https://assemblage-dataset.net/

Recommended citation: C. Liu, R. Saul, Y. Sun, et al., “Assemblage: Automatic Binary Dataset Construction for Machine Learning,” in NeurIPS 2024 Datasets and Benchmarks Track, 2024.
Download Paper

Communication-Avoiding Recursive Aggregation

Published in IEEE International Conference on Cluster Computing (CLUSTER), 2023

Recursive aggregation has been of considerable interest due to its unifying a wide range of deductive-analytic workloads, including social-media mining and graph analytics. For example, Single-Source Shortest Paths (SSSP), Connected Components (CC), and PageRank may all be expressed via recursive aggregates. Implementing recursive aggregation has posed a serious algorithmic challenge, with state-of-the-art work identifying sufficient conditions (e.g., pre-mappability) under which implementations may push aggregation within recursion, avoiding the serious materialization overhead inherent to tradi- tional reachability-based methods (e.g., Datalog). State-of-the-art implementations of engines supporting recur- sive aggregates focus on large unified machines, due to the chal- lenges posed by mixing semi-na¨ıve evaluation with distribution. In this work, we present an approach to implementing recursive aggregates on high-performance clusters which avoids the com- munication overhead inhibiting current-generation distributed systems to scale recursive aggregates to extremely high process counts. Our approach leverages the observation that aggregators form functional dependencies, allowing us to implement recur- sive aggregates via a high-parallel local aggregation to ensure maximal throughput. Additionally, we present a dynamic join planning mechanism, which customizes join order per-iteration based on dynamic relation sizes. We implemented our approach in PARALAGG, a library which allows the declarative implemen- tation of queries which utilize recursive aggregates and executes them using our MPI-based runtime. We evaluate PARALAGG on a large unified node and leadership-class supercomputers, demonstrating scalability up to 16,384 processes.

Recommended citation: Y. Sun, K. Sidharth, G. Thomas, and K. Micinski, “Communication-Avoiding Recursive Aggregation,” in IEEE International Conference on Cluster Computing (CLUSTER), Santa Fe, USA, 2023.
Download Paper

So You Want to Analyze Scheme Programs With Datalog?

Published in International Conference on Functional Programming(ICFP) Scheme Workshop, 2021

Static analysis approximates the results of a program by examining only its syntax. For example, control- flow analysis (CFA) determines which syntactic lambdas (for functional languages) or (for object-oriented) methods may be invoked at each call site within a program. Rich theoretical results exist studying control flow analysis for Scheme-like languages, but implementations are often complex and specialized. By contrast, object-oriented languages (Java in particular) enjoy high-precision control-flow analyses that scale to thou- sands (or more) of lines of code. State-of-the-art implementations (such as DOOP on Soufflé) structure the analysis using Horn-SAT (Datalog) to enable compilation of the analysis to efficient implementations such as high-performance relational algebra kernels. In this paper, we present an implementation of control-flow analysis for a significant subset of Scheme (including set!, call/cc, and primitive operations) using the Souf- flé Datalog engine. We present an evaluation on a worst-case term demonstrating the polynomial complexity of our 𝑚-CFA and remark upon scalability results using Soufflé.

Recommended citation: Y. Sun, K. Sidharth, G. Thomas, and K. Micinski, “So You Want To Analyze Scheme Programs With Datalog?” In International Conference on Functional Programming(ICFP) Scheme Workshop, Online, 2021.
Download Paper

Declarative Demand-Driven Reverse Engineering

Published in NDSS Workshop on Binary Analysis Research (BAR), 2021

Binary reverse engineering is a challenging task because it often necessitates reasoning using both domain-specific knowledge (e.g., understanding entrypoint idioms common to an ABI) and logical inference (e.g., reconstructing interprocedural control flow). To help perform these tasks, reverse engineers often use toolkits (such as IDA Pro or Ghidra) that allow them to interactively explicate properties of binaries. We argue that deductive databases serve as a natural abstraction for interfacing between visualization-based binary analysis tools and high-performance logical inference engines that compute facts about binaries. In this paper, we present a vision for the future in which reverse engineers use a visualization-based tool to understand binaries while simultaneously querying a logical-inference engine to perform arbitrarily-complex deductive inference tasks. We call our vision declarative demand-driven reverse engineering (D^3RE for short), and sketch a formal semantics whose goal is to mediate interaction between a logical-inference engine (such Souffle) and a reverse engineering tool. We describe aprototype tool, d3re, which are using to explore the D^3RE vision. While still a prototype, we have used d3re to reimplement several common querying tasks on binaries. Our evaluation demonstrates that d3re enables both better performance and more succinct implementation of these common RE tasks.

Recommended citation: Y. Sun, J. Ching, and K. Micinski, “Declarative Demand-Driven Reverse Engineering,” in NDSS Workshop on Binary Analysis Research (BAR), Online, 2021
Download Paper

Yihao Sun