This blog has already hosted several interesting articles about program synthesis, — i.e., how to automatically generate programs from high-level expressions of user intent? In this blog post, we will look at exciting applications of program synthesis in data science.
Challenges in Data Science
Due to the “big data” revolution, one of the most popular job titles these days is “data scientist”, and there has been an explosion in the number of software libraries that aim to facilitate data science. While the end goal of data science is to extract insights from large amounts of data (e.g., using machine learning), many sources claim that a typical data scientist spends 50-80% of their time performing so-called “janitor work” of data science, which includes tasks such as data reshaping, cleaning, and imputing (i.e., filling missing entries).
Data Science as an Ideal Application Domain for Program Synthesis
Fortunately, many data science tasks can be automated using modern program synthesis technology. In particular, while we are still quite far from synthesizing arbitrary programs, data science tasks have three key characteristics that make them an ideal application domain.
- First, many people use input-output examples to describe typical data wrangling tasks, making programming-by-example (PBE) an excellent match.
- Second, even though input-output examples turn out to be too ambiguous for most application domains, this is not as big of an issue in the context of data wrangling — in this context, program synthesizers have some hope of generating the intended program with relatively little input from the user.
- Finally, even though the kinds of programs required for data preparation are non-trivial for most users (e.g., require familiarity with the right libraries or involve higher-order combinators), they are actually not all that large.
Given that current synthesis technology is good at generating very concise programs that are often difficult for humans to write, many of the scripts arising in data science are good candidates for being synthesizable.
Program Synthesis for Data Preparation: Techniques and Applications
Data preparation refers to the process of processing raw data and preparing it for downstream analysis and visualization tasks. This includes extracting a relevant subset of the data, converting to a unified data format, removing missing entries, and putting it into a shape that is suitable for subsequent analysis.
In recent years, researchers have shown that almost all aspects of the data preparation pipeline can be successfully automated using programming by example.
For example, work by Sumit Gulwani, Rishabh Singh, Alex Polozov and others have shown how to automate many data extraction, cleaning, and reshaping tasks that arise in the context of Excel spreadsheets. Early success stories of program synthesis in this context have been enabled through the use of version space algebras (VSA), which compactly represent the set of all programs that are consistent with a given set of input-output examples. The basic idea here is to construct a data structure (e.g., DAG) for each input-output example and then combine them (via intersection) to obtain programs that satisfy all of the examples. To make this process more scalable, a popular approach is to utilize inverse semantics of the underlying DSL to decompose the overall synthesis problem into simpler problems. In fact, this form of compositional VSA learning forms the basis of Microsoft’s Prose framework, which has attained enormous success in automating spreadsheet tasks ranging from input formatting to relational data extraction to table transformations.
Other variants of version space learning have also been successful for automating data preparation tasks like data imputation and reshaping.
For example, in recent work with Rishabh Singh and my former PhD student Xinyu Wang, we developed new version learning algorithms based on finite tree automata (FTA) (i.e., automata that recognize trees). Since programs can be represented in terms of their abstract syntax trees (ASTs), it is natural to think of program synthesis as the construction of an FTA that recognizes exactly those programs that satisfy a given specification. (see Fig 1 below) In addition to being more compact than VSAs for some application domains (e.g., in the context of data imputation), FTAs also have the advantage of not requiring the DSL’s inverse semantics (which can be hard or impossible to come up with). Furthermore, FTAs can be combined with program analysis techniques like predicate abstraction to make the version space even more compact, resulting in faster synthesis. Such advances in version space learning have made it possible to automate more challenging data science tasks like data imputation and complex reshaping using a programming-by-example approach.
However, the success stories of program synthesis in data science are not limited to spreadsheets. For instance, my former PhD student Yu Feng, I, and other collaborators have developed program synthesizers that target the R programming language, which is extremely popular among data scientists and statisticians. While R provides a rich set of libraries (tidyr, dplyr, tidyxl, purrr, …) for helping with data preparation tasks, such APIs are nonetheless not so easy to use, and on-line forums abound with questions on how to use these APIs to solve specific data wrangling tasks. In our work, we showed that it is entirely possible to automate even fairly complex data wrangling tasks using a programming-by-example approach. However, for technical reasons (e.g., widespread use of higher-order combinators), this domain turns out not to be a good fit for the version space learning techniques mentioned above. Thus, a more suitable approach is to employ type-directed search combined with SMT-based logical reasoning to prune the search space. In subsequent work, we also showed how to learn from failed synthesis attempts and better guide the search using statistical models and reinforcement learning. The cool thing about this line of work is that we were able to automate in minutes challenging data preparation tasks that even proficient R programmers take an hour to solve. We are currently working on making our tool chain available as an RStudio plugin.
Beyond spreadsheet automation and code generation for R, program synthesis has also been successfully used for generating Python code and SQL queries. For example, synthesis researchers from UW have successfully used programming-by-example for generating SQL queries, and researchers from UC Berkeley have recently built a tool called AutoPandas for automatically generating data manipulation programs for Python by incorporating neural architectures into the synthesis process.
Program Synthesis for Data Science Beyond Data Preparation
So far, I have only talked about the usefulness of synthesis for data preparation, which, as mentioned earlier, is considered the “janitor work” of data science. However, recent work has shown that synthesis can also have interesting applications in this domain.
For example, recent work by Jose Cambranero and Martin Rinard from MIT considers the problem of automatically generating entire machine learning pipelines for solving supervised learning tasks.
As another example, our recent work shows how to automate data visualization tasks from examples. Given the original dataset and a tiny (manually-constructed) visualization for a small subset of this data, our synthesis tool called Viser can automatically synthesize a script that performs both the necessary data wrangling and the desired visualization. For instance, this figure gives an example of a visualization task that can be automatically completed using Viser.
Overall, program synthesis has already been quite successful at automating many types of data science tasks, and I believe that there are even more ways in which program synthesis and, more broadly PL research in general, could boost the productivity of data scientists and make their lives easier.
For instance, one interesting avenue for PL research is helping data scientists debug their machine learning models. Problems in trained models can be due to a number of different factors (e.g., the nature of training data, the model architecture, the choice of hyperparameters, a bug in the actual code), and identifying the root cause of low accuracy on the test data can be very challenging. I believe that some of the ideas of the PL community could be effective in helping data scientists identify and fix problems in machine learning models and the training process.
Bio: Işil Dillig is an Associate Professor of Computer Science at the University of Texas at Austin where she leads the UToPiA research group. Her current research interests are program synthesis and verification and their applications in databases / data science, security, and artificial intelligence.
Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.