PATPAT: Program analysis, the practice and theory: 2012

Tuesday, December 11, 2012

An easier way to make exams

Recently, some of my colleagues bemoaned the effort of making up exams at the end of the term. I wanted to share a trick that I have used, that makes creating exams much easier. It might work for you too.

I work on exam questions throughout the term rather than waiting until the end. A TA is assigned to each lecture, and is required to make up a couple of questions during or immediately after the lecture. This way, we get ideas that pop into the TA's head but that might have been forgotten by the end of the term. Another excellent source is student questions during class. Likewise, we harvest common misperception and questions from office hours. These often turn into excellent exam question as well, since they are topics that could confuse a student but that were covered in class.

More advice about creating an exam appears at http://homes.cs.washington.edu/~mernst/advice/exams.html

Don't scan your recommendation letters

Faculty are writing lots of recommendation letters right now. One of my colleagues went to extra effort to print his letter, sign it, scan it, and then upload it. Unfortunately, this was counterproductive.

If you are requested to provide a letter in PDF, provide the original PDF that was created by your word processor or typesetting program. Don't scan the document, which makes it much harder to read. You don't want eyestrain (or anything else) to lessen the impact of your letter. You can either insert a PDF signature, or have just a typewritten signature; no one really cares whether your signature is on the letter.

Another irritation I have seen: don't use "watermark" letterhead that puts a large, dim image (for example, of the university crest or logo) behind the text. Some people apparently think this looks cool and sets their letter or institution apart (MIT CSAIL, I am looking at you), but in fact it makes the letter harder to read without impressing anyone.

More advice on writing a letter of recommendation appears at http://homes.cs.washington.edu/~mernst/advice/write-recommendation.html

Sunday, October 28, 2012

Call for tutorial proposals for ICSE 2013

San Francisco, CA, May 2013

ICSE is soliciting high-quality tutorial ideas that will attract both practitioners and researchers. Proposals are due on November 2, 2012.

ICSE Tutorials present the state of the art or the state of the practice in a topic related to software engineering. A tutorial can be aimed at researchers, practitioners, managers, teachers, or students; different styles and topics would be appropriate in the different cases. For example, a tutorial might provide an overview to a new research area for researchers who wish to enter it or understand it; skills to use state-of-the art development techniques; an explanation of a development or research methodology. Tutorials on both mature and emerging topics are welcomed. The length can be 90 minutes, 3.5 hours, or 7 hours.

For more details, please see http://2013.icse-conferences.org/content/tutorialsCFP

Saturday, October 27, 2012

SPLASH 2012 Doctoral Symposium

This week, I participated as a mentor in the SPLASH Doctoral Symposium. A doctoral symposium is an opportunity for a small group of students to get feedback about their research. The students also got to learn from watching other people go through the same process. (This is a rather different format than the "PhD Working Groups" I wrote about previously.)

The event was so productive that it ran 2.5 hours over its scheduled time.

I had expected that the committee would give the students concrete suggestions about their choice of problem, the evaluation methodology, related work, and the like. Instead, our feedback focused primarily on presentation issues.

One recurring problem was that the talks — and even the 2-minute "elevator talks" — often didn't state the concrete problem until midway through. It is essential that the first sentence of an elevator pitch, or the first slide of a talk, states the problem, the key idea of the solution, and any evidence such as results. If you do not grab the attention of the audience from the first moment, your presentation will be a failure. You can proceed to general background and motivation later, but don't start with generalities that aren't directly relevant to your work, and don't dwell over-long on them.

Neither the students nor the other committee members was familiar with the Heilmeier Catechism, which is a set of questions for evaluating a proposed project. I've found it very helpful:

What are you trying to do? Articulate your objectives using absolutely no jargon.
How is it done today, and what are the limits of current practice?
What's new in your approach and why do you think it will be successful?
Who cares?
If you're successful, what difference will it make?
What are the risks and the payoffs?
How much will it cost?
How long will it take?
What are the midterm and final "exams" to check for success?

After a round of feedback, we gave the students time to revise their elevator pitches and their formal talks, which led to significant improvements, though there were still plenty of comments from the committee. Congratulations to the students for the progress they made!

Thursday, October 4, 2012

A conceptual introduction to version control

Version control is a commonly-used, essential tool, but one that often confuses beginners. I often find myself often repeating the same explanations about version control concepts and the same advice about best practices -- especially for people who are new to distributed version control. I haven't found a good source for this information, so I have written it up in a webpage that I hope others will find useful: http://homes.cs.washington.edu/~mernst/advice/version-control.html

Tuesday, September 11, 2012

Cost of fixing bugs at different stages of the software development process

It is generally-accepted common knowledge that the cost to fix a bug increases exponentially throughout and different stages of the software development process. But what is the basis of this belief? This author tried to track down the source data: An apology to readers of Test-Driven iOS Development.

Wednesday, September 5, 2012

Nominated for Outstanding Java Spec Lead

I have been nominated (again!) for the "Outstanding Spec Lead" award that is presented by the Java Community Process (JCP), the organization that oversees the evolution of the Java language and platform. The JCP asked me to reflect on my co-leadership (with Alex Buckley of Oracle) of JSR 308, which extends Java's annotation system. Here is what I wrote.



JSR 308 (Type Annotations) has the ambitious goal of changing the Java language.  It provides a platform for developers to express and verify correctness and security properties that currently remain implicit, or are documented and checked in error-prone and ad-hoc ways.  As with any language change, it must integrate smoothly with existing features and tools.  These goals required meticulous and sustained effort.  Here are some principles that guided our work.

Don't give up, despite any adversity.  I first implemented a pluggable type system for Java in 2002, over 10 years ago.  I pitched the idea of expanding the Java programming language to Sun in 2004.  JSR 308 was formed in 2006.  And the feature will finally appear in Java 8, in 2013.  Related to not giving up, don't get started unless you really believe in the goal.  Only passion will give you the stamina for the long haul.  It's worth it!

Get help from others who will help you do great work.  In a sense it's unfair to offer an award to just one person such as me.  I have been tremendously helped by many people, notably Alex Buckley and Werner Dietl, but also the JSR 308 Expert Group, Mahmood Ali, Jonathan Gibbons, Brian Goetz, Doug Lea, Matt Papi, and many others (too many to list them all here).  I thank them all for their significant contributions.  The real winners will be Java programmers.


Run your project transparently, in the open.  This is the best way to get lots of valuable feedback.  Your passion will help you to withstand the criticism that is inevitable in an open project when not everyone's desires can be met.  Be straightforward in admitting this, and offer explanations and principles that explain the reasoning.  JSR 308's mailing lists, FAQs, etc. are linked from http://types.cs.washington.edu/jsr308/.

Balance innovation and risk.  Innovative ideas that change how developers work often come from academic research.  This was the case for the type systems that motivated JSR 308.  But, the motivation must come from real problems that arise in practice, during developers' daily lives.  JSR 308 omitted many features requested by academics because they have not yet been sufficiently tested in the field.

Work fast to meet product deadlines, but don't succumb to pressure to cut corners.  It's better to seek a cleaner solution than to saddle Java programmers with an ugly design forever.  You will end up with something, like JSR 308, that you can be proud of.

Tuesday, August 28, 2012

Teaching intro CS and programming by way of scientific data analysis

Bill Howe and I taught a new intro programming class this summer, aimed at enabling students to write programs that process real-world data. This is an essential skill for any profession, not just for those who want to become computer scientists or programmers. The class website is http://www.cs.washington.edu/education/courses/cse190p/12su/ and it will be taught again starting in January 2013.

Goals

Our goals were twofold:

to provide data analysis and programming skills that students can use in their classes, their research, and their jobs, and
to expand the pie of computer science education, teaching in a way that will attract different learners and complement existing approaches.

Our class was officially titled "CSE 190p: Introduction to Data Programming with Applications", to avoid confusion with existing intro programming classes.

In one sense, our class was like any other introduction to computer programming: we taught students the syntax and semantics of a programming language, and more importantly taught computational thinking, such as how to manage data and algorithmic complexity.

The most unique feature of our class is that every assignment (after the first, which introduces Python basics) uses real-world data: DNA files straight out of a sequencer, measurements of ocean characteristics (salinity, chemical concentrations) and plankton biodiversity, social networking connections and messages, election returns, economic reports, etc. Whereas many classes explain that programming will be useful in the real world or give simplistic problems with a flavor of scientific analysis, we are not aware of other intro programming classes taught from a computer science perspective that use real-world datasets. (But, perhaps such exist; we would be happy to learn about them.)

By contrast, many intro classes have assignments focused around abstract problems, often encoded as puzzles and games. That is a great motivation for certain students (it can be very effective, and it was for me!), but other students are more interested in the uses to which technology can be put than in the technology itself. The latter category seems to include many individuals who are currently underrepresented in computing, such as women.

A more minor distinction is that our class differs from existing classes at UW in that it used Python, an easy-to-use, powerful, concise programming language that is increasingly popular in the sciences, engineering, and beyond. Python is still somewhat unusual for an introductory programming class, but the number of intro Python classes is increasing and this choice is no longer controversial. It fit our goals well.

Assignments

The assignments were a success. Students were really impressed that by the second week of the term, they had written a program that read an intricate DNA data format and had computed facts about the organism.

The assignments gave a lot of support at the beginning, when students just filled in templates that the staff provided. The last assignment was an open-ended project that required students to propose, implement, and report upon a data analysis project of their choosing, starting from scratch. The student-proposed projects included determining the authorship of literary texts; predicting earthquakes; correlating ethnographic and economic data; determining the mRNA target of miRNA; and determining sunspot patterns and variances. The code tended to be relatively small (a few hundred lines of code), but the project required developing scientific hypotheses, locating datasets, devising algorithms, and reporting on the results, so the code was emphasized to the appropriate degree.

In addition to the assignments that were motivated by real-world problems, we assigned small exercises and weekly in-class quizzes that drill students in Python coding. This helped test understanding of specific concepts without requiring us to make the assignments too prescriptive. The quizzes revealed gaps in understanding that were not apparent, or were not apparent quickly enough, on the assignments.

Topics

Topics covered in the class included:

Python concepts:

Expressions, values, types, variables, programs & algorithms, control flow, file I/O, the Python execution model

Data structures:

List, set, dictionary (map), tuple, graph (from a third-party library)
List slicing (sublist), list comprehension (shorthand for a loop)
Mutable and immutable data structures
Distinction between identity and (abstract) value

Functions:

Procedural abstraction, functions as values, recursion, function design methodology

Data abstraction (introduction only, no in-depth coverage):

Modules, objects

Testing and debugging:

Test design, coverage, & adequacy
Debugging strategies: divide & conquer, the scientific method

Speed of algorithms
Statistical hypothesis testing
Visualization (graphing/plotting results)
Program decomposition

The class's focus (developing students with the ability to analyze scientific datasets) had a significant impact on the choice and ordering of topics, compared to more traditional introductory programming classes. For example, students learned how to use data structures (lists, sets, dictionaries, graphs) and algorithms (searching, sorting, etc.), but we did not teach students how to re-implement them. We focused more on functions that compute and return values rather than functions that have a side effect of printing some value. Students learned file I/O, but not terminal I/O, interactive programs, or how to write a GUI.

Time constraints caused us to narrow our focus to the most important topics. We avoided certain obscure corners of the Python language. We taught debugging concepts and strategies, but not how to use a debugger. We did not teach scaling up to big programs and big data (out-of-memory data, parallel programming). But, the students are capable of solving a computational or data processing problem that fits into memory on a desktop workstation.

We used the book "Think Python: How to Think Like a Computer Scientist". We thank its author Allen Downey for writing it. The book is generally good (it was the best-suited of the many books we reviewed before we started teaching), but it is too brief. It gives too little detail and often fails to point students at the next place to look. Its order and selection of topics also did not match ours. We augmented it with other readings, but will probably try a different book or more augmentation in the future.

Student reactions

The class was highly-rated by students. Here is feedback I received after the class.

I really enjoyed the pace of the class even though at times it was slow. Everything that we were taught was valuable to us learning Python and to programming in general. I definitely recommend this class to anyone who doesn't have programming experience. Not only do you learn about how to process data by making your computer do what you want it to do, you also develop a basic knowledge of what programming is and how it works. I believe that everyone should know something about programming especially in our computer-driven world.

Conclusion

Overall, our main hypothesis was supported. Non-CS students can be taught to program using real-world datasets from the outset. They are engaged by the realism, and they end up with the ability to do elementary data processing on datasets of their own choosing.

Monday, July 23, 2012

OSCON talk: "Open Source 2.0: The Science of Community Management"

One of the best talks I saw at OSCON 2012 was "Open Source 2.0: The Science of Community Management" by David Eaves, a consultant who specializes in negotiation. He had a very short talk during the opening plenary session (this part was videotaped). Then, he had a full 40-minute slot, which I attended because I had enjoyed the keynote. I didn't take any notes because I had to stand in the back with a lot of other people who had crowded into the room, so this summary will be impressionistic.

Eaves talked cogently about specific challenges faced in building community around open-source projects. These include their distributed nature, differing motivations, part-time attention from members, and the fact that email is a terrible communication mechanism that is highly susceptible to misinterpretations of meaning.

Eaves contrasted the typical zero-sum, adversarial approach to negotiation to one in which the parties respect one another, attempt to understand one another, and seek a solution that is advantageous to everyone.

Eaves emphasized the importance of listening. He caterogized different purposes of communication. Then, he characterized one long thread in a bug-tracking database as everyone trying to prove a point without anyone listening. Understanding the needs and motivations of others will make you much more effective, but too many people don't do it.

Eaves noted that people are not evil or irrational. People act rationally to maximize their perceived self-interest. Thus, understanding their self-interest, and their perception of it, can help you to find solutions that satisfy both them and you. (That also might include educating them about their true self-interest, if you can do it without being bombastic.) It's important to know your own self-interest and to be rational about it. Separate out the things that are actually important to you from the things that have to do with your ego, or with one particular way of achieving those ends, or from your first ideas about a solution.

Eaves discussed that open-source projects tend to be unfriendly to newbies, as I mentioned above. Reactions to suggestions from new members, or to suggestions that are slightly off, can be aggressive and downputting. There are several rational explanations for this behavior. One explanation is that developers believe that thick skin is correlated with competence; thus, their behavior is an effective and efficient way to weed out people who would not make valuable contributions. (It's a different question whether that correlation actually exists, and thus whether the developers' behavior is productive.) Other explanations include that developers are protecting their own time or increasing their reputation. Eaves suggested a "newbie" badge beside newcomers' posts, so that other people would treat them more gently. I had always thought that the badge was there as a return mechanic -- people will want to keep posting to get "newbie" off their profiles -- but this is another excellent reason for it. Some people will treat the newbies more gently, and other people might be more inclined to ignore them.

Overall, Eaves provided good information about seeking win-win, and not in a perfectly fluffy and content-free context, but with examples and exercises.

He recommended two books, "Getting to Yes" and "Difficult Conversations", both of which were already on my list of books to read -- maybe now I will actually read them.

He also pointed at some of his blog posts, which I have not yet read:
http://eaves.ca/2011/06/14/how-github-saved-opensource/
http://eaves.ca/2011/04/07/developing-community-management-metrics-and-tools-for-mozilla/
http://eaves.ca/2007/02/05/wikis-and-open-source-collaborative-or-cooperative/

OSCON talk: "Harnessing the Good Intentions of Others for your OSS Project"

I attended OSCON 2012 in Portland last week, and I saw some good talks and some bad ones. (Mine was middling -- a disappointment, but I will use what I learned to improve for the future.)

One good OSCON talk I attended was titled "Harnessing the Good Intentions of Others for your OSS Project", by Llewellyn Falco and Lynn Langit:

The idea is that there are a lot of developers out there who would love to help you, but the barriers to their entry are too high. A potential contributor must understand your system, figure out where to make a patch, and submit it. This is already more than the two hours or so that most people are willing to spend making a contribution to your project. (A related point made by David Eaves at the same conference, and possibly repeated in this talk, is that when the maintainers review the patch, they usually reject it. The main feedback to the newbie is the "invalid" status of the bug report.) The end result is that almost no new developers ever join an open-source project. The speakers claim to have a 98% conversion rate (though I am not sure how meaningful that statistic is) and shared their approach.

Here are three key points I took away from it.

1. Listen for feedback and problems. The speakers suggested setting up a Twitter search (you can get an RSS stream or a Google alert), and also searching StackOverflow and blogs. Whenever you get a comment, repsond quickly -- definitely within two days, and usually faster because after two days the person has completely moved on from the issue. It may seem overwhelming to look in so many places for buzz or anti-buzz about your product, but start small and build up as you build your developer base.

2. Pair Programming. Whenever you get a communication with another developer, offer to pair program and don't stop asking until the other person agrees. This can compress the other developer's learning time into the two hours they are likely to be willing to spend, and the other developer is much less likely to become frustrated or confused. It can also help you to understand the patch. A speaker recounted that he hated one proposed patch because it wasn't elegant and didn't fit into the system's intended design. Rather than just rejecting it, he pair programmed with the proposer and after a while realized that his system was architected in a way that prevented any cleaner, better solution. So he accepted the patch, and that patch has been important for his community of users.

When pair programming, start with the camera on for a few minutes to establish a personal connection. Then, after that, go voice-only and use screen sharing. Even Skype's crummy screen sharing works pretty well, and other systems like join.me are even better, especially ones that let you share the keyboard and mouse as well as the screen.

The "98% conversion rate" statistic was very fuzzy to me. I suspect it is the percentage of pairing sessions that eventually led to at least one commit by someone. I would find it hard to believe as the number of developers who became active in the project. I wish the presenters had been more clear and upfront about this, because it felt like they were misleading or overselling.

A downside is that pair programming is an incredibly time-consuming approach: it's hard to imagine spending multiple hours for each communication that comes to an open-source project. The speakers don't consider this work: they enjoy pair programming so it is just fun. Furthermore, the potential benefit from a fix for an important bug or from attracting a new developer may be very large, so significant time investment is worthwhile.

3. Action items. In any presentation, your final slide should contain a single, specific action for someone to take. One speaker recounted that when he started putting a download button on his talks, his downloads went up a lot.

Sunday, February 26, 2012

Cold, rainy Seattle

Three times a week, I walk across campus to class. After 8 weeks of the quarter, I have worn a jacket exactly once, because it is generally dry and warm enough to do without a jacket.

To be completely fair, I missed a few days due to travel and I don't know how the weather was then. And, on some of days it rained at other times, such as overnight, which keeps the region green but doesn't mar our enjoyment of it. Finally, at least once it was misting outside, but not enough to justify a jacket. Some days it was overcast, on others gloriously sunny.

Seattle has an undeserved reputation for non-stop precipitation. A popular joke says "It only rains once a year in Seattle, but it lasts 9 months." In reality, Seattle doesn't get very much rain.

Saturday, February 11, 2012

Java type annotations: early draft review

The type annotations specification ("JSR 308") is in Early Draft Review. This is an opportunity for the public to comment on the proposed specification, so that it can be improved before being incorporated in the Java language.

Oracle has announced that they intend to include, in JDK 8, support for type annotations. Currently, Java only permits annotations, such as @Deprecated or @Override, to be written on declarations. The ability to write annotations on type uses enables improved documentation and bug detection. One example use is the Checker Framework for pluggable typechecking. Note that the Checker Framework is a third-party tool and not part of the Java language proper.

Saturday, January 14, 2012

Overcoming writer's block

I have added a new section, "Getting started: overcoming writer's block and procrastination", to my popular advice for writing a technical paper.

Monday, January 9, 2012

Command-line option argument processing in Java

If your program processes command-line options, then you have to write duplicative, boilerplate code and documentation:

to parse command-line options and set variables in your program,
for usage messages (such as printed by a --help option), and
for documentation in the program's manual and/or manpage.

It is a pain to write all this code. Furthermore, it is easy for the different representations of data about command-line options to get out of sync, which leads to bugs and user confusion.

When you are writing in Java, a better approach is the Options class of plume-lib. If you use the plume.Options class, you do not have to write any code, only declare and document variables. For each field that you want to set from a command-line argument, you write Javadoc and an @Option annotation. Then, field is is automatically set from a command-line option of the same name, and usage messages and printed documentation are generated automatically.

This class has been in daily use for well over five years and is slowly gaining adherents, but it still pains me when I see code that duplicates command-line logic and documentation. This includes most other solutions I am aware of, including Apache Commons CLI. It is, however, similar to args4j, which seems to have been independently conceived around the same time as this class. Use whichever one you find more convenient and useful.

For full usage information, see the plume.Options documentation.

PATPAT: Program analysis, the practice and theory