Spring in Practice

Willie Wheeler's Spring blog

Concurrent Programming for Practicing Software Engineers, Part 2

| Comments

This post is part of a three-part series: Part 1 | Part 2 | Part 3

This post is the second in a three-part series on concurrent programming. In the first post we learned some basic concepts such as threads and serial vs. parallel execution. In this post we’ll learn how parallel execution can create problems if we’re not careful.

How parallelism gives rise to race conditions

Even among parallelizable problems, some are harder than others. The easy ones are the ones that don’t involve threads sharing state with one another. For example, if you have five independent work queues, and the goal is simply to process all the tasks in all the queues, then there’s nothing especially tough about this. You start five threads, run them against the five queues, and then move on once they’re done.

The trickier problems are the ones that involve threads sharing state with one another. Let’s see why.

Threads and shared state

Shared state presents a challenge in concurrent programming because threads can interfere with one another if their access to that shared state isn’t properly coordinated. If you’ve ever tried to collaborate with other authors on a single document, you might have some experience with this problem. (Though Google Docs does a nice job of supporting multiple simultaneous authors.) You make some changes to the document, hoping that your coauthor didn’t make his own changes to the part that you worked on. If he did, then there may be a mess to clean up.

The essence of the problem is that at least one guy is updating the document while somebody else is either reading it or writing it. We talked about the case where you have two authors writing to the document, but it can be one guy writing and someone else reading. If one person is in the middle of updating emergency information for hurricane evacuees and another person reads a partial update, the result could be dangerous.

Note that there’s no problem if you have a bunch of people simply reading a document at the same time. The problem only comes up if at least one of them is writing to it.

The problem we’ve just described is called a race condition. The idea is that if threads aren’t properly coordinated, then their concurrent work is vulnerable to problems associated with “unlucky timing”. There can be situations in which one thread is trying to update shared state and one or more other threads are trying to access it (either read or write). In effect the threads “race” against one another and the outcome of their work is dependent on the nondeterministic outcome of their race.

Let’s look at a code example.

An example of code that allows race conditions

Originally I was going to present the standard (and probably a bit tired) web page hit counter example. But then today I saw somebody’s purportedly threadsafe class that is in fact susceptible to race conditions. The code was written by an experienced engineer. So let’s use that example instead, shown in Listing 1 below, as it’s instructive.

Listing 1. A class that’s supposed to be threadsafe, but isn’t
public class InMemoryUserDao implements UserDao {
    private Map<String, User> users;

    public InMemoryUserDao() {
        users = Collections.synchronizedMap(new HashMap<String, User>());
    }

    public boolean userExists(String username) {
        return users.containsKey(username);
    }

    public void createUser(User user) {
        String username = user.getUsername();
        if (userExists(username)) {
            throw new DuplicateUserException();
        }
        users.put(username, user);
    }

    public User findUser(String username) {
        User user = users.get(username);
        if (user == null) {
            throw new NoSuchUserException();
        }
        return user;
    }

    public void updateUser(User user) {
        String username = user.getUsername();
        if (!userExists(username)) {
            throw new NoSuchUserException();
        }
        users.put(username, user);
    }

    public void removeUser(String username) {
        if (!userExists(username)) {
            throw new NoSuchUserException();
        }
        users.remove(username);
    }
}

What do you think? See anything? The engineer claims that the class is threadsafe because we’ve wrapped our hashmap with a synchronized instance.

Threadsafety problems with the code

It turns out that that’s not correct. It’s a common misconception that as long as you’re dealing with a threadsafe collection (like a synchronized map), you’ve taken care of any threadsafety issues you might have otherwise had. We’ll use this example to show why this belief is just plain wrong.

Before I expose the race conditions, let me state some plausible assumptions about the intended semantics for the class:

  • Calling createUser() with an user whose username already exists in the map should generate a DuplicateUserException.
  • Calling updateUser() with an user whose username does not already exist should generate a NoSuchUserException.
  • Calling removeUser() with an user whose username does not already exist should generate a NoSuchUserException.

Now let’s see how the implementation fails to support those semantics, and in particular how the implementation permits race conditions.

createUser() is broken. Here’s a race that shows that createUser() does not have the intended semantics:

  1. Thread T1 enters createUser(), with argument user U1 having username N. N is not already in the map.
  2. Thread T1 successfully makes it past the if test, but not yet to the line that puts N in the map.
  3. Context switch to thread T2, which also enters createUser(), this time with argument user U2 having the same username N that U1 has. (You can imagine, for example, two users registering with the same username at the same time.) N is still not in the map, because thread T1 hasn’t put it there yet.
  4. Thread T2 completes its call to createUser() and exits. User U2 and username N are now in the map.
  5. Thread T1 completes the method and executes. User U1 has been placed into the map under the key N.

Poof! User U2 simply disappears from the map, having been overwritten by user U1. T1 never encounters the DuplicateUserException it’s supposed to encounter.

updateUser() is broken. This one is a little more subtle than the createUser() case. Here’s the race:

  1. Thread T1 enters updateUser() with user U having name N. The user already exists in the map, so T1 passes the if test successfully. It hasn’t reached the put() part yet.
  2. Context switch to thread T2, which enters and exits removeUser() with username N (the same one that T1 was using). The user U is removed from the map.
  3. Thread T1 completes, placing U back in the map under key N. User U has essentially been reinstated.

The problem here is that the removeUser() method completed successfully, there was no createUser() call, and yet there’s still a user sitting in the map with no error having been thrown. Given the semantics described above, that should not be possible. The only two possible outcomes ought to be: (1) the update occurs before the remove, which would mean that when the two threads complete, U is no longer in the map; or (2) the remove occurs before the update, which would mean that U should be gone, and the update should have generated an exception. In either case, U should not be in the map when the two threads complete, and yet there it is. So if T2 belongs to an admin who thought he was removing a troublesome user, guess again.

removeUser() is broken. We’ve already seen in the race above that there are situations under which removeUser() would be expected to remove a user but does not actually do that. Here’s another race for you, probably less significant, but still a bug:

  1. Thread T1 enters removeUser() with existing username N, and makes it past the if test.
  2. Context switch to thread T2, which enters and exits removeUser() with the same username N.
  3. Thread T1 completes and exits the method.

In this case we have two methods that called the removeUser() method with the same username, but neither thread encountered a NoSuchUserException. That is not compatible with the semantics of the class.

Now that we’ve seen some problems, let’s try to understand what’s happening here.

What the three broken methods have in common

One thing that all three cases have in common is that they follow a “check-then-act” pattern, where both the “check” and “act” parts reference shared data (in this case, the hashmap). The createUser() method checks whether the username already exists, and if not, adds the user to the map. The updateUser() method checks whether the username already exists, and if so, updates the user. removeUser() checks whether the username exists, and if so, deletes the user.

Note that the findUser() method does not follow the same pattern. It does a single get() against the map, and then does a check that checks the returned user rather than checking something against the map. Though it looks like almost the same thing, it is really a completely different situation because we’re performing only one action against the shared data instead of multiple actions.

It turns out that the difference between createUser(), updateUser() and removeUser(), on the one hand, and the findUser() method, on the other, is important from a threadsafety point of view. Essentially what’s happening is that the findUser() method is atomic (with a certain qualification, which we’ll discuss in the next section), meaning that it executes as a single action. The createUser(), updateUser() and removeUser() methods, on the other hand, are not atomic. The various check and act operations can be interleaved in arbitrary ways across different threads. Intuitively what that means is that checks, which are supposed to provide safety, can become invalid because other actions can be scheduled in between any given check-act pair. The safety checks are invalidated and the class behaves in strange ways (especially as load increases, which is usually the most inopportune time for such issues to arise).

Now let’s look at how we can solve race conditions such as the ones we’ve been considering.

Solving race conditions with synchronization

The key to avoiding race conditions is to coordinate the multiple threads when accessing shared data. There are different ways to do this, but the simplest is the one your mom taught you when you were a kid: take turns.

Taking turns with mutexes

Here’s how you take turns in software. Suppose that you have some shared data that you want to protect, such as a list of all the posts in a given web-based discussion forum, or a web page hit counter. You have various blocks of code throughout your application that access that data, both reading it and writing it. These blocks of code don’t have to be in the same class (though they often are), but they’re all accessing the same data.

(Keep in mind that when we describe the data as shared data, we mean that it’s shared by multiple threads, not that multiple blocks of code access it. If multiple blocks of code access the data but only one thread ever does, then it’s not shared data.)

The trick is to protect the code so that only one thread can enter it at a time. I don’t mean that only one thread can enter one of the code blocks at a time. I mean that if you consider the whole set of code blocks as a single group, only one thread can enter that group at a time. So if one thread enters one of the blocks, then all other threads need to be kept out of all other blocks until that first thread is done.

The way we enforce this behavior is to associate with each block of code you want to protect something called a mutual exclusion lock, or mutex. People often just call it a lock as well. You associate a lock with any given piece of data you want to protect, and all the code blocks that access that data use that same lock. A thread is not allowed to enter the code block unless it acquires the lock. So whenever thread T1 wants to enter, it attempts to grab the lock. If the lock is available, then T1 takes it and other threads have to wait until T1 is done. If instead some other thread T2 has the lock, then T1 has to wait until T2 is done.

When a thread leaves the block of code protected by the mutex, known as the critical section, it releases the lock.

Incidentally, I always thought it was kind of funny we talk about grabbing locks. It seems like we should say we grab a key so we can get past the lock. But that’s not how people talk about it.

We now introduce the principal way of implementing a mutex in Java, the synchronized keyword.

The synchronized keyword

In Java the primary mechanism for enforcing turn-taking across threads is the synchronized keyword. You basically put the code you want to protect inside a synchronized block, specifying a lock object (or more exactly, specifying an object whose monitor will be used as a lock—in Java, each object has a “monitor” that can be used as a mutex). If you want to synchronize access to the whole method, you just use the synchronized keyword on the method itself. In this latter case, the locking object is the one on which the method is being called.

Here’s an example of a synchronized block:

synchronized (myList) {
    int lastIndex = myList.size() - 1;
    myList.remove(lastIndex);
}

Here, we’re using the myList instance as the lock. Anytime a thread wants to enter this block of code, or any other block of code protected by the same lock, it needs to grab the lock on the myList instance first. And it releases the lock upon exiting the synchronized block (i.e., the critical section).

Here’s an example of a synchronized method:

public class Order {

    ...

    public synchronized void removeLastLineItem() {
        int lastIndex = myList.size() - 1;
        myList.remove(lastIndex);
    }
}

In this case, a thread wanting to enter the removeLastLineItem() method would need to grab the lock on the relevant Order instance first, and would release the lock after exiting the method.

Armed with our understanding of mutexes and the synchronized keyword, let’s revisit our example from the previous page.

Our example, revisited: making the code threadsafe

Listing 2 shows how to apply thread synchronization to make the class threadsafe.

Listing 2. A threadsafe version of our DAO
public final class InMemoryUserDao implements UserDao {
    private Map<String, User> users;

    public InMemoryUserDao() {
        users = Collections.synchronizedMap(new HashMap<String, User>());
    }

    public boolean userExists(String username) {
        return users.containsKey(username);
    }

    public void createUser(User user) {
        String username = user.getUsername();
        synchronized (users) {
            if (userExists(username)) {
                throw new DuplicateUserException();
            }
            users.put(username, user);
        }
    }

    public User findUser(String username) {
        User user = users.get(username);
        if (user == null) {
            throw new NoSuchUserException();
        }
        return user;
    }

    public void updateUser(User user) {
        String username = user.getUsername();
        synchronized (users) {
            if (!userExists(username)) {
                throw new NoSuchUserException();
            }
            users.put(username, user);
        }
    }

    public void removeUser(String username) {
        synchronized (users) {
            if (!userExists(username)) {
                throw new NoSuchUserException();
            }
            users.remove(username);
        }
    }
}

In Listing 2 we’ve addressed the three issues that we raised earlier. First note that the users instance is entirely encapsulated by our InMemoryUserDao class (I’ve made the class final to prevent inheritance from breaking our encapsulation here), so that means it is entirely within our power to protect every possible access to the users instance. (Hm, that’s probably a good interview question: “Discuss the relationship between encapsulation and threadsafety.”)

Next note that we’ve addressed the general problem that we saw by making every “check-then-act” occurrence atomic. Now the code is such that if thread T checks something (such as checking whether a username exists), we can be sure that the answer doesn’t change before the action part, because no other thread is able to access the users instance at all until T is done with it.

Internal vs. external synchronization

There’s one detail to discuss. Take a look at the userExists() and findUser() methods. In both cases we access the users instance, but there’s no synchronized block. The reason this isn’t a problem is that both of the following are true:

  1. we’re calling only one method on the users map, and
  2. the map is internally synchronized since we created the map using the Collections.synchronizedMap(...) wrapper.

If either of those two conditions had failed, we’d need to put the calls in a synchronized block. It’s useful to discuss these a little more carefully.

Let’s take (1) first. Note that even though we don’t have an explicit synchronized block, each individual method call we make against users is threadsafe. That’s because the users map is internally synchronized on the monitor for the users instance itself. That is, the map handles synchronization for individual method calls so we don’t have to. If we’re just going to call a single method on an internally synchronized method, we’re usually OK to do so.

The place where we have to pay attention to threadsafety is where we call multiple methods on the users instance. There, the internal synchronization doesn’t help us at all because that synchronization doesn’t prevent the thread from releasing the lock in between the multiple method calls. And when it releases the lock, another thread can come in and change the data. If your code logic assumes that the multiple method calls behave as an atomic unit, you have a problem. Therefore you must provide your own synchronization on the relevant lock when you use check-then-act and similar patterns.

It is a common error to assume that just because the object is internally synchronized, you don’t have to worry about threadsafety anymore. That is simply not true, as we’ve just explained.

Now take (2). Clearly if the map didn’t have any internal synchronization, then we’d need to provide it ourselves, even for single method calls against the map. Otherwise threads could get in and muck things up for other threads. This is true even in many cases where it looks like individual threads are performing atomic actions. For example, if you have some threads making get() calls against a HashMap, and another set of threads making put() calls against the same HashMap, you may think that they are all performing atomic operations and therefore it’s OK to do this. But that is not correct. The problem is that they only look like atomic operations; behind the scenes, they’re impacting the size of a shared backing array, which in turn impacts how hashcodes are mapped to buckets, which impacts the get() and put() calls themselves. If the class had internal synchronization then get() and put() would both be atomic from the perspective of external code, but without that, they aren’t atomic at all.

That was your crash course on solving race conditions with thread synchronization. It’s not a simple topic, but you should now have the basics. But even though thread synchronization helps solve one problem, unfortunately it creates another. The problem is that in forcing threads to stop and wait their turn, you create the potential for the situation in which threads might wait indefinitely before they can move on. This is called a deadlock, and we’ll turn to it in the next post in the series.

This post is part of a three-part series: Part 1 | Part 2 | Part 3
Post migrated from my Wheeler Software site.

Comments