Tuesday, August 28, 2012

Teaching intro CS and programming by way of scientific data analysis

Bill Howe and I taught a new intro programming class this summer, aimed at enabling students to write programs that process real-world data.  This is an essential skill for any profession, not just for those who want to become computer scientists or programmers.  The class website is http://www.cs.washington.edu/education/courses/cse190p/12su/ and it will be taught again starting in January 2013.


Our goals were twofold:

  • to provide data analysis and programming skills that students can use in their classes, their research, and their jobs, and
  • to expand the pie of computer science education, teaching in a way that will attract different learners and complement existing approaches.

Our class was officially titled "CSE 190p: Introduction to Data Programming with Applications", to avoid confusion with existing intro programming classes.

In one sense, our class was like any other introduction to computer programming:  we taught students the syntax and semantics of a programming language, and more importantly taught computational thinking, such as how to manage data and algorithmic complexity.

The most unique feature of our class is that every assignment (after the first, which introduces Python basics) uses real-world data:  DNA files straight out of a sequencer, measurements of ocean characteristics (salinity, chemical concentrations) and plankton biodiversity, social networking connections and messages, election returns, economic reports, etc.  Whereas many classes explain that programming will be useful in the real world or give simplistic problems with a flavor of scientific analysis, we are not aware of other intro programming classes taught from a computer science perspective that use real-world datasets.  (But, perhaps such exist; we would be happy to learn about them.)

By contrast, many intro classes have assignments focused around abstract problems, often encoded as puzzles and games.  That is a great motivation for certain students (it can be very effective, and it was for me!), but other students are more interested in the uses to which technology can be put than in the technology itself.  The latter category seems to include many individuals who are currently underrepresented in computing, such as women.

A more minor distinction is that our class differs from existing classes at UW in that it used Python, an easy-to-use, powerful, concise programming language that is increasingly popular in the sciences, engineering, and beyond.  Python is still somewhat unusual for an introductory programming class, but the number of intro Python classes is increasing and this choice is no longer controversial.  It fit our goals well.


The assignments were a success.  Students were really impressed that by the second week of the term, they had written a program that read an intricate DNA data format and had computed facts about the organism.

The assignments gave a lot of support at the beginning, when students just filled in templates that the staff provided.  The last assignment was an open-ended project that required students to propose, implement, and report upon a data analysis project of their choosing, starting from scratch.  The student-proposed projects included determining the authorship of literary texts; predicting earthquakes; correlating ethnographic and economic data; determining the mRNA target of miRNA; and determining sunspot patterns and variances.  The code tended to be relatively small (a few hundred lines of code), but the project required developing scientific hypotheses, locating datasets, devising algorithms, and reporting on the results, so the code was emphasized to the appropriate degree.

In addition to the assignments that were motivated by real-world problems, we assigned small exercises and weekly in-class quizzes that drill students in Python coding.  This helped test understanding of specific concepts without requiring us to make the assignments too prescriptive.  The quizzes revealed gaps in understanding that were not apparent, or were not apparent quickly enough, on the assignments.


Topics covered in the class included:

  • Python concepts:
    • Expressions, values, types, variables, programs & algorithms, control flow, file I/O, the Python execution model
  • Data structures:
    • List, set, dictionary (map), tuple, graph (from a third-party library)
    • List slicing (sublist), list comprehension (shorthand for a loop)
    • Mutable and immutable data structures
    • Distinction between identity and (abstract) value
  • Functions:
    • Procedural abstraction, functions as values, recursion, function design methodology
  • Data abstraction (introduction only, no in-depth coverage):
    • Modules, objects
  • Testing and debugging:
    • Test design, coverage, & adequacy
    • Debugging strategies:  divide & conquer, the scientific method
  • Speed of algorithms
  • Statistical hypothesis testing
  • Visualization (graphing/plotting results)
  • Program decomposition

The class's focus (developing students with the ability to analyze scientific datasets) had a significant impact on the choice and ordering of topics, compared to more traditional introductory programming classes.  For example, students learned how to use data structures (lists, sets, dictionaries, graphs) and algorithms (searching, sorting, etc.), but we did not teach students how to re-implement them.  We focused more on functions that compute and return values rather than functions that have a side effect of printing some value.  Students learned file I/O, but not terminal I/O, interactive programs, or how to write a GUI.

Time constraints caused us to narrow our focus to the most important topics.  We avoided certain obscure corners of the Python language.  We taught debugging concepts and strategies, but not how to use a debugger.  We did not teach scaling up to big programs and big data (out-of-memory data, parallel programming).  But, the students are capable of solving a computational or data processing problem that fits into memory on a desktop workstation.

We used the book "Think Python: How to Think Like a Computer Scientist".  We thank its author Allen Downey for writing it.  The book is generally good (it was the best-suited of the many books we reviewed before we started teaching), but it is too brief.  It gives too little detail and often fails to point students at the next place to look.  Its order and selection of topics also did not match ours.  We augmented it with other readings, but will probably try a different book or more augmentation in the future.

Student reactions

The class was highly-rated by students.  Here is feedback I received after the class.
I really enjoyed the pace of the class even though at times it was slow. Everything that we were taught was valuable to us learning Python and to programming in general. I definitely recommend this class to anyone who doesn't have programming experience. Not only do you learn about how to process data by making your computer do what you want it to do, you also develop a basic knowledge of what programming is and how it works. I believe that everyone should know something about programming especially in our computer-driven world.


Overall, our main hypothesis was supported.  Non-CS students can be taught to program using real-world datasets from the outset.  They are engaged by the realism, and they end up with the ability to do elementary data processing on datasets of their own choosing.


Tapan said...

Sounds like a great class! Weve been thinking about and doing some similar things at the high school level and younger. Would be great to compare notes.

btw, I came across this adaptation of Downeys book recently..

Michael Ernst said...

Thanks for the pointer to "Python for Informatics" -- I didn't know about this book, but from the preface it sounds like its goals are similar to mine. The following is from the preface:

My goal in SI502 is to teach people life-long data handling skills using Python. Few of my students were planning to be be professional computer programmers. Instead, they planned be librarians, managers, lawyers, biologists, economists, etc. who happened to want to skillfully use technology in their chosen field.


The first 10 chapters are similar to the Think Python book but there have been some changes. Nearly all number-oriented exercises have been replaced with data-oriented exercises. Topics are presented in the order to needed to build increasingly sophisticated data analysis solutions. Some topics like try and except are pulled forward and presented as part of the chapter on conditionals while other concepts like functions are left until they are needed to handle program complexity rather introduced as an early lesson in abstraction. The word “recursion” does not appear in the book at all.

In chapters 11-15, nearly all of the material is brand new, focusing on real-world uses and simple examples of Python for data analysis including regular expressions for searching and parsing,

Allen Downey said...

Michael, this sounds like a great class, and I couldn't agree with you more about the importance of working with real data.

It sounds like Think Python is not a perfect match for your class, but that's part of the reason I put the book under a Creative Commons license. You should do a mash-up of Think Python and Think Stats!