Saturday, August 11, 2012

lparallel and how I was thinking too complicated

Yesterday, I posted a short article that showed how I used lparallel to speed up a lengthy computation by reading input data files in parallel. A few readers responded, and the response from James M. Lawrence, who is the author of lparallel, is reproduced here because it contains source code that can't be properly formatted in blogspot comments (any suggestions regarding something better?).

[citing James M. Lawrence]
Here is a map-reduce-style version:
(defun slurp-alist (hash-table alist)
    :for (key . data) :in alist
    :do (setf (gethash key hash-table) data))

(defun load-files-in-parallel (pathnames)
  (flet ((parse (pathname)
           (let ((result nil))
              (lambda (key data)
                (push (cons key data) result)))
    (reduce #'slurp-alist
            (lparallel:pmap 'vector #'parse pathnames)
            :initial-value (make-hash-table))))

The benefit here is that worker threads do not have to contend for a queue lock. A possible drawback is that the key-data pairs survive until the final reduce, whereas the pairs have a shorter lifetime when passed through a queue. Of course performance will depend upon the scale and shape of the data.

Benchmarks with early versions of lparallel showed better performance with a vector-based queue, however since then I found a simple cons-based queue to be generally faster, which is currently the default. There is still a compile-time option to get the vector-based queue, and unless that is enabled the number passed to make-queue is ignored. In other words, the 10000 is a lie :)

Motivated by James' reply, I spent some time to find out how the process could be further optimized. The idea to have the workers cons up the data and then collect all data into the hash table at the end of the operation reduced the run time for my application by some 30%, which is not bad.

A comment from Max Mikhanosha on G+ suggested that I could avoid the queue and directly use a thread safe hash table. It is a little embarrassing that I assumed that I could not write hash tables from multiple threads. Both CCL's and SBCL's hash tables provide for thread safe modes, so there is no point in tunneling the data through a queue.

As I am using CCL, I tried different values for the :SHARED keyword argument to MAKE-HASH-TABLE. The default is :LOCK-FREE, which makes hash tables thread safe without requiring the grabbing of a lock for each operation. Other permitted values are T, which locks the table for each access, or NIL for thread-unsafe hash tables.

I found that when writing to the hash table directly from the file readers, :SHARED T would give me the best performance. With that option, my code ran another 10% faster than with consing up the file results and then processing all the lists at the end.

lparallel is a great example how Common Lisp is an extensible language. The operators provided by it seamlessly integrate parallel programming into the language, and with the help of my readers, the code that I originally posted now became very small:

(defun load-files-in-parallel (pathnames)
  (let ((hash-table (make-hash-table :shared t)))
    (lparallel:pmap nil
                    (lambda (pathname)
                      (read-and-parse-file pathname
                                           (lambda (key data)
                                             (setf (gethash key hash-table)

Look, Ma! All I needed was pmap!

Friday, August 10, 2012

Using lparallel to utilize multiple cores [updated]

A while ago, I discovered the lparallel library as a nice set of abstractions to make parallel programming in Common Lisp easier and more pleasant. The use case that I presented in my previous blog post was more about dealing with asynchronicity rather than parallelism, though. As I have seen people ask for lparallel examples, I will present how I used it to actually speed up a long running computation by having multiple cores work on the problem at the same time.

The problem that I have been working on involved reading and parsing a few large data files to extract some data from them. The extracted data needed to be stored in one hash table for later processing. The reading and parsing process is CPU bound, so reading the data files in parallel will speed up the overall process. "Speed up" here means "reduce the run time because I'm waiting for completion", not "make it as fast as possible" or "have it utilize all cores". I'm trying to reduce my own wait times, and I cannot spend time debugging locks. Thus, the high abstraction level of lparallel fits me just right

For the task at hand, I created a set of workers which read the input files and send the extracted data items to a collector process by the way of a queue. The collector process puts them into a hash table for later processing. Once all files are read, the hash table is returned.

(defun load-files-in-parallel (pathnames)
  (let ((queue (lparallel.queue:make-queue 10000)))
      (lparallel:pmap nil
                      (lambda (pathname)
                         (lambda (key data)
                           (lparallel.queue:push-queue (list key data)
                        (lparallel.queue:push-queue :done queue))
    (let ((hash-table (make-hash-table))
          (remaining-worker-count (length pathnames)))
      (loop while (plusp remaining-worker-count)
            for item = (lparallel.queue:pop-queue queue)
            if (eq item :done)
              do (decf remaining-worker-count)
            else do (destructuring-bind (key data) item
                       (setf (gethash key hash-table) data)))

Some points are worth noting:

  • Instead of using a queue to send the data to a separate worker for hash table insertion, the hash table could be locked or a lock-free hash table could be used. This may or may not be faster.
  • The scheduling of the workers is done by lparallel and may or may not be efficient. I've looked at CPU utilization and found the process to use 3-6 cores, so it seems that the scheduling is not optimal. I'm guessing that the queue processing and hash table insertion eat more time than I'd hope for, but I have not made a real assessment.
  • I've picked the size of the queue randomly. It may be worthwhile to make the queue larger so that file readers are not blocked while hash table rehashing occurs.

Implementing this parallel file loader took me just a few minutes and gave my program considerable speedup. Using memory is already easy, and with lparallel, using multiple cores also becomes much easier than it has been for me.

[Update] James M. Lawrence, the author of lparallel, kindly pointed out that the original version of my example code contained an unneeded forced future to read the data. I have updated the code according to his remark.