Diving into Clojure, part 1

I’ve made it a goal of mine to become fluent in at least one functional language within the year. I’m already familiar with several, but I’d had trouble finding one which I really felt like using.

First, I was looking for a language which I could find a practical use for. That tosses out a lot of small and/or academic languages, like Scheme (which I’ve used before).

Second, I was looking for a language which has a large and growing user base. That throws out things like Common Lisp and OCaml (the latter of which I’ve used and enjoyed), and pretty much leaves Haskell, Erlang, and Clojure.

Third, I was looking for a language which, broadly, didn’t make me want to kill myself. I’ve used Haskell and Erlang in limited senses in the past (writing xmonad extensions/configuration and ejabberd extensions, respectively). I haven’t enjoyed those experiences, for different reasons. Haskell permits the invention and use of new operators. It also appears to have multiple formatting standards for code. Most Haskell code I’ve encountered is terse and composed mostly of symbols. Of course, most Haskell devs I’ve talked to consider this an advantage over Algoloid languages like Ruby. I don’t.

Erlang’s syntax is fine, but after programming for years in languages with builtin associative arrays (or even just records) it’s painful to work in a language without them. Its strings are also rough to deal with. And I’m yet to figure out any consistency in how errors are reported or logged. There’s a lot of good to say about Erlang, but there’s a lot of bad to say, too.

That left Clojure, which is a Lisp. I’m familiar with Lisps, and I’ve never minded the big piles of parentheses. Writing small and well-formatted functions avoids paren hell, anyway. Clojure has several other advantages – it’s a JVM language, which means I can write Clojure that interacts with my existing JRuby code. It’s a much thinner layer on top of Java than JRuby. You’re encouraged to use the Java standard library. I’m familiar with Java as a language, but I could use more experience with its libraries. I’m also interested in Clojure’s vaunted concurrency primitives (refs, atoms, agents, and vars).

Last night I took a first stab at Clojure programming. I have an existing Ruby class which you can feed a list of URIs, which it will then download in parallel and hand to another class for packaging up into a ZIP file. I sat down with Programming Clojure and, after installing Leiningen, I spent about an hour hacking together mass-download. It doesn’t currently download in parallel, but I’ve separated out the components I think I’ll need to have working in separate threads.

The first thing I had to do was figure out how Leiningen works. Clojure’s official release is just as a Java JAR. You can run the JAR directly to get a REPL, but it doesn’t have a command-line interface like many popular languages. Leiningen provides this service, as well as downloading and managing multiple Clojure versions (like rvm or rbenv in Ruby-land), managing dependencies (like Rubygems and Bundler), constructing new application skeletons, running tests, and building your app (like Rake). Having a single tool provide all of this functionality was surprising to me, coming from a Ruby background, but it might be more familiar to Java devs who are used to IDEs. (Leiningen is a command-line tool, though.) The closest analogy I can make is to the rails command used for initializing a Rails project and performing various development tasks. It also appears that Leiningen can be extended through some sort of plugin system – I’ll look into this more at a later time.

lein new mass-download creates a new directory and populates it with a basic application skeleton:

    % find

project.clj is a Leiningen project file, containing metadata about the package.

    (defproject mass-download "1.0.0-SNAPSHOT"
      :description "FIXME: write description"
      :dependencies [[org.clojure/clojure "1.3.0"]])

It’s similar to a gemspec in Ruby-land. Note that the package is called mass-download (Lisps favor dashes over underscores) but the filenames include mass_download. Clojure apparently silently translates from one to the other. I’m not sure what the reason for this is.

All that src/mass_download/core.clj contains is a namespace declaration:

    (ns mass-downloads.core)

This is where you put your code. I barely even know how including dependencies works in Clojure so far, so my project just uses this sole file. I’m also not familiar with Clojure’s testing framework yet, so I’ve decided to test the old-fashioned way – edit, build, run, see if it works, repeat. To make it possible to run your package as a command, your namespace has to include a function called ‘-main’. So that’s where I start:

    (defn -main [] 
      (mass-download (urls-from-file "urls.txt")
                      (in-dir "downloaded" only-basename)))

All of these functions except for spit are defined in the file. When I run the mass-downloader, it looks for a file called “urls.txt” and reads in each line as a URL to be fetched. It passes each URL to the http-download function, which then passes it on to spit to be saved. The fourth argument is a function for translating URLs to filenames. I don’t like the way mass-download is invoked at the moment – I’d prefer it to be more obvious that in-dir comes into the process before spit.

spit itself is a Clojure core function. Given any kind of writeable argument (including a filename), it will output a string to the appropriate destination, opening and closing the file in the process. It’s the companion function to slurp, which reads the entirety of any type of readable to a string. These are analogous to Ruby’s File.write and File.read, except that they operate on URLs too, as well as any IO object that implements the correct protocol. (I’m just getting into reading about protocols in Clojure, but they’re a more flexible way of doing what interfaces do in Java. Rubyists would do the same thing with duck typing, although protocols let you add glue code to existing types in a way you’d likely have to use monkey-patching to do in Ruby.)

    (defn mass-download
      "Download many files and pass them to a handler."
      ([url-source url-downloader file-acceptor url-to-path]
        (url-source (fn [url]
          (url-downloader url (fn [url file] 
            (file-acceptor (url-to-path url) file)))))))

Here’s the main function which does the work. I’m pretty sure I’m making this more complicated than it needs to be. It invokes the URL source function, passing a callback to be invoked with each URL. I should probably be using a ‘seq’ instead – ‘seq’ is Clojure’s generic protocol for lazily-evaluated, potentially indefinite sequences, including maps, vectors, lists, and recursively-invoked functions. (I use one later.) The callback invokes the URL downloader function with each URL, as well as a callback which receives each URL and the downloaded data for that URL. This final callback invokes the file acceptor function with the filename to be saved and the data. In this case, file-acceptor is spit, which takes a filename and a string to put into that file. I’ll eventually want it to instead be a function which saves the file to a ZIP archive.

    (defn urls-from-file
        (fn [action]
          (with-open [rdr (clojure.java.io/reader file)]
            (dorun (map action (line-seq rdr)))))))

This is the url-source used above. More accurately, it returns that URL source when given a filename. The function returned, when invoked, opens the file, reads each line, and invokes a passed function on each line. The with-open construct allows you to declare a scope where an opened IO will automatically be closed upon exiting. This is similar to Ruby’s block form for File.open.

line-seq, when applied to a reader, returns a ‘seq’ consisting of newline-separated strings from the file. Like all seqs, line-seq is lazily-evaluated – at the point it’s called, the file has not been read. (This results in a problem if you pass the seq outside the with-open construct, incidentally. Don’t do that.) This is much like Ruby’s File#lines, which returns an Enumerator if not invoked with a block. map applies its first argument (a function) to every element in the seq and returns a new seq. More accurately, it returns a new seq which, when traversed, applies the function to each element in the underlying seq.

This is important to remember – seqs are lazy, and functions which apply to them are lazy too. When map returns a value, it still hasn’t actually invoked the callback on any of the lines in the file – it hasn’t even read the lines from the file yet. If you want to traverse the seq (because, as in this case, the function being applied to the seq has side effects), you have to do so explicitly. The dorun (and doall) forms do this. It’s not pure FP, and I suspect that the way to make this purer lies in the Clojure sequence library. But in this case, it works (and, after all, we’re trying to create a side-effect – downloading and writing out a file). So dorun here invokes the function for every line in the file. That function, as shown above, invokes this downloader with the URL and a callback:

    (defn http-download
      ([url file-acceptor]
        (file-acceptor url (slurp url))))

All this does is pull down the contents of the URL (using the Clojure core function slurp) and pass it to the file acceptor, which first turns the URL into a filename:

    (defn only-basename
        (.getName (clojure.java.io/file url))))

    (defn in-dir
      ([dir file-fn]
        (fn [fname]
          (join "/" [dir (file-fn fname)]))))

    ; and then when invoking mass-download
    (in-dir "downloaded" only-basename)

This creates a function which uses Java’s File::getName() to extract the basename of a file out of the URL, and then prepends “downloaded/” so that the file is stored in a subdirectory. Finally, this filename and the data to be stored are passed to spit, which writes out the data to that filename.

My next step with this is to make the downloading concurrent. My best idea on how to do that right now is get rid of url-source and turn it into a seq, and then to create an agent for each entry for that seq. As I understand it, agents are executed in a thread pool. What I’m not sure about at this point is how to keep track of how many files have been downloaded (and, more importantly, when all of them have been) without doing something gross like iterating over the agents in a loop and polling them to see if they’re done. I’ll be posting again once I’ve done it.