In 2018, I was two years into a tenure-track junior faculty position at Carnegie Mellon University when I saw what was happening in the world of software. 2018 was when Cambridge Analytica happened; soon afterward, GDPR became law. Most academics who start companies in industry do so because they are commercializing a specific project; I left to start Akita Software because I saw that the timing was right to build the kinds of tools I wanted to build—those that could help improve the quality and security of software. For the last two years, I’ve been working to build developer tools at the API level for modern web apps.
Because the technical problem was not set in stone at the time I started Akita, I spent the first year of working on the company doing product and user research work. My academic career trained me to solve problems in a principled way. Through the process of getting Akita up and running, I learned how to select problems in a principled way. This was important because for the first time, the success of my efforts was being defined by how well I solved problems that people cared about enough to pay real-world money for. Here are my learnings.
Problem selection is a science, too
When I was in research, I spent a lot of time talking to programmers “in the wild,” to understand the problems they had. So I thought I was doing my homework in terms of problem selection. In the last two years, I’ve learned about user research and realized that my problem selection in academia could have been much more methodical and principled!
When you are trying to understand the precise problem, there are principled ways of understanding user pain—even when your users are programmers and you are working on something as complex as programming languages and tools. You can’t prove them using math or measuring them like you do compiler performance, but you can be scientific all the same, e.g., by applying rigorous methods. This is something that the field of HCI has been doing for a long time—and that people have been advocating for merging more with programming tools, but still hasn’t reached most of programming languages research. (For instance, see this paper “Programmers are Users Too: Human-Centered Methods for Improving Programming Tools” and Sarah Chasins’s syllabus for “Building User-Centered Programming Tools.”)
One of the main things I learned about user research was to be rigorous about how I asked user questions. The most straightforward way is to ask all users the same questions in the same order, under the same conditions. It is also important not to ask leading questions, nor to directly ask users what they want—there’s that famous saying that if Henry Ford had asked people what they wanted, they would have said faster horses. If your hypothesis is that tying one’s shoes is a pain point when trying to leave the house, it is not a good idea to ask directly: “Do you have trouble tying your shoes?” A better way to do this is to ask the user to walk you through what happens when they leave the house and ask them for what the pain points are. (For more on this, you may be interested in “Principles of Contextual Inquiry.”)
Another big lesson I learned during user research was to zoom out. Previously when I had been talking to users when doing programming languages research, my conversations usually centered around the pain developers were having around the specific problem I was working on. What I learned when doing user research was to understand the user as a whole person, complete with greater motivations and incentives, and understand how the problem I solved fit in with their goals, hopes, and dreams. For instance, here are questions I asked before learning more about user research:
- Tell me about your experience with problem X.
- What solutions for problem X would you find most helpful?
- Would you use my solution for your problem X?
And here are some of the questions we asked dozens of users when I started Akita, that had very little to do with the specific tool we were building at all:
- Tell me about your role in the company.
- Walk me through a day in your job. What tools are you using? How are you spending your time?
- What tools do you use on a daily basis?
- Tell me about what led you to adopt tool X.
- What are issues that would cause you to stay at work past the usual time?
- Tell me about how you interface with the rest of engineering. What frictions do you have there? (This was a question for security engineers.)
- How do you get promoted?
As a programming languages researcher, I certainly did not ask these questions in a consistent and disciplined way during problem selection—if at all. From asking these questions, I learned, for instance, that developer adoption impacted the success of a security tool much more than anything security engineers thought about the tool. And that developer adoption often hinged more on actionability than accuracy. These realizations not only led us to completely pivot what we thought our product should be (we started out focused on security/privacy-related usage of APIs, but pivoted to general-purpose use-cases), but also forever changed my views on building programming languages and tools.
What matters to users is not what I thought
Embracing the approach described above led to important changes in what we were doing. When I started Akita, I thought what we were going to build was the equivalent of an application performance monitoring tool for tracing data, targeted towards security and privacy teams. I had come to this initial conclusion from talking to people at a high level about what they thought the problems were, and what they thought was needed. Through doing user research, we instead decided to build low-friction integration tools targeted at inferring and testing against API specs, and to pivot our solution away from targeting security or privacy specifically. Here are some of the lessons that led to that:
- The most important way a security engineer can spend their time is improving their relationship with software engineering. When we asked security engineers how they got promoted and/or increased their influence in their organization, it came up over and over again that a security team was only as good as the engineers who listened to them. This gave us confidence that building developer tools was the right way to go, since they have much greater influence!
- One of the biggest challenges security engineers face is getting software engineers to fix their bugs. When we asked security engineers to walk us through what they ended up spending a lot of time on, it involved convincing engineers to fix things. For instance, one engineer I spoke to spent a lot of his time coming up with examples to convince engineers that something they wrote is insecure. Another team took it one step further and hired penetration testers just to prove out the importance of certain security/privacy controls to their engineering teams. The results of this user research led us to believe that it wasn’t enough to provide reports to security that something was wrong; a more powerful tool would not only help identify possible problems, but also help developers prioritize and fix the problems. (Our findings match those of this study.) As a developer tools team, this was right up our alley, leading us to start building tools that helped with providing context and reproducibility.
- Integration friction matters. When we asked users about what led them to choose the tools that they currently used, ease of integration into existing tool chains came up over and over again, as well as how much configuration the tool needed. This was especially true for security engineers, as they burned social capital whenever they made asks of other teams. We realized if we didn’t make a tool that was easy to integrate, it would be very hard to get adoption.
- Building a developer tool for security means building for the developer/security gap. Over the course of the first year of working on our product, we realized that our product design and go-to-market involved straddling the communication and incentive gap between security engineering and software engineering teams. After understanding the point of view of each side and that developers did not actually seem very interested in closing it, we ultimately decided we needed to choose a side. We chose to become a developer tool with a security use case, instead of the reverse. Without having done the user research we had done, the choice would not have been nearly as obvious!
Through the process of doing this user research, I also learned to let go of many of the assumptions I had held in academia:
- Soundness may not be the right goal. In academia, it is often the goal to make static program analyses sound: if a bug is not reported, then that bug should not be possible. A common complaint about static analyses is that they aren’t actually that helpful a lot of the time because of the false positives. One reason this is a problem, for instance, is that the more bugs security teams report to developers, the less likely the bugs are to get fixed. In academia, I had thought of false bugs as non-bugs; many industry developers would call a low-priority bug a “false positive.” A surprising (to me) number of engineers said they’d rather not know about all of the bugs, but instead be informed about a subset of bugs that were likely to be fixed. Peter O’Hearn’s “Incorrectness Logic” paper is one that addresses this side of the problem.
- “Legacy factors” may be more important than guarantees or performance. More generally, the tradeoffs that people want to make are not what you might expect, so they are worth exploring. When I dug deeper on my problem, for instance, I learned that the barrier to entry was often convincing people to adopt tools, which had more to do with integrations and learning curve than than math or performance. If people never got to the point of appreciating what was good about a tool, they would not appreciate the tool! This is definitely a different point of view than the one I had in academia, where the focus was on building clean versions of new features, leaving adoption as an exercise to the reader. While requiring all academic work to worry about adoption would certainly curb a lot of innovation, viewing adoption as “not academia’s problem” may be be curbing the opportunity for impact. Given that frictionless integration with legacy tool chains yields surprisingly challenging technical problems and can disproportionately affect impact, it should be given more attention in academia.
Should we be evaluating PL papers differently?
My experience learning about user research led me to revisit something that I had previously thought a lot about: how should we be evaluating programming languages research papers? While a paper’s Introduction is what gets reviewers interested in a paper in the first place, the evaluation section gives a champion reviewer the necessary ammunition to go to battle at the Program Committee meeting. Because of this power, what shows up in the evaluation section determines what kind of work gets done in entire fields.
My strong belief is that programming languages research does not value usability because there currently is no satisfying, agreed-upon way of evaluating it. Semantics evaluations guarantee properties like soundness, completeness, and type safety, but they don’t say anything about whether someone will actually care about the guarantees or the language. Benchmarks determine the tool runs fast according to sample programs. Benchmarks are great for compilers of known languages, but are often not as appropriate for language or tool design papers that do not focus around performance. The hour-long user studies we often see are often inadequate for evaluating complex languages and tools. (Coblenz et al have a recent TOCHI submission outlining some existing shortcomings in evaluation processes and proposing a new process.)
The parts of language or tool design that have to do with developer experience—which, for a lot of papers, is a lot of things—end up only in the introduction and in various assumptions throughout the paper. What this means is that these parts are not directly evaluated. And when we don’t directly evaluate usability claims, two things happen. First, work that focuses on improving usability in a thoughtful way does not necessarily have a better chance of getting positively peer-reviewed. Second, usability claims do not get critically reviewed, meaning the field is less likely to move towards solutions that improve developer experience. Right now, when a paper’s main strength is usability, it has a worse time getting through peer review because it’s up to the specific reviewers whether they choose to appreciate it or not.
Here’s a question that I would like to pose to the field: What would it look like to apply rigorous evaluation to problem selection for PL papers, instead of case studies for the result? For papers that propose things for developer productivity, why not evaluate them against interviews/surveys of where developers are actually struggling? Why not also make the criteria (for instance, soundness or performance) something that needs to be justified by the user research? “Soundness” is a formal statement that an analysis respects semantics according to some semantic property of interest, for instance no run-time failures or no information. What would it look like to justify this property itself, e.g., through user interviews?
I don’t know the answers, but here is one version of what it could look like. For researchers, understanding that there exists a framework for evaluating usability may change how you decide to pose and evaluate those claims. There is no need to go all-out and do a collaboration with an HCI researcher, though that could be interesting and helpful for the work. Even conducting any interviews is better than nothing. For reviewers, understanding usability evaluation may help you consider it as an acceptable form of evaluation for PL research—and flag when papers make usability claims without backing them up rigorously. And the only way we can move towards this is if the leaders of the community support this shift towards rigorous user-centric work.
A final word: we should be careful not to let the pursuit of any specific kind of evaluation dominate a field, including user-centric evaluation. On the one hand, I love the idea of adding human-centered evaluation sections to highly technical programming languages paper. I have always thought of programming languages research as combining human factors with math and code—and these evaluation sections would finally reflect this marriage. On the other hand, academia isn’t industry for a reason. We need room for the what-ifs. And the more that we require of evaluations, the harder it will be to justify the intuition that accompanies truly novel discoveries. While I believe it would move the field forward to consider human-centered evaluation on a similar footing to semantics-centered or performance-centered evaluations, we should be wary of making any one of these evaluations required.
Acknowledgments: Thanks to Will Crichton for comments and suggestions on a draft of this post!
Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.