Finding last names in Excel

I talked to someone today who had a list of names in Excel, and wanted to sort them by last name. Unfortunately, they were not split into first / last names. What is the (non-programmer) to do? It’s actually a good opportunity to talk about doing stuff in Excel, since it tends to be a tool I end up using a lot to work with data. And this is a surprisingly common problem. It is also a nice way to talk a little bit about functions in Excel beyond “SUM”. (If you want to cut to the chase, hereis the Excel document.) So, let’s get started:

First

For reasons that may not appear immediately obvious, I’m going to treat this a bit like a programming exercise. Before starting to write it out in Excel, let’s figure out what we need to do. Take, for example, the following name:

  A B
1 George Bush  

How is it that we figure out what we need to do. Even though we are going for the surname, let’s start with the slightly easier version of finding out the first name.

First, we need to state what we want to do as clearly as possible:

In B1, put the left-hand word from A1

The LEFT Function

That’s good, but unfortunately, Excel doesn’t have a function “left-hand word”–it doesn’t even know what a word is! It does, however, have the function LEFT which takes a number of characters from the left-hand side of a string of characters. So, we have a solution:

  A B
1 George Bush =left(A1,6)

If we type this in and hit ENTER, we end up with “George” in B1. Which is great for people with 6-letter surnames, but not-so-great for people with 5-letter surnames. So we need to rethink our specification, and make it more detailed:

In B1, put a string of characters representing everything left of the space in the string of characters in A1

OK, Excel doesn’t know what a “word” is, but we can tell it explicitly what the space is. Unfortunately, again, there isn’t a function “everything left of the space.” We have “LEFT” but it needs a specific number. So clearly, we need to find out how far from the left the space is. Luckily, there is a function that does that: FIND. Let’s try this:

  A B
1 George Bush =find(” “,A1)

(Note that there is a SPACE between those two quotation marks.) If you hit enter, you will have a 7, since the space first shows up in position seven in the string. Now we’re getting somewhere.

The Two-Step

So we know where the space is, and we have a way of getting strings of arbitrary sizes from the left-hand side of something. What if we put these two together. Let’s add a column.

  A B C
1 George Bush 7 =left(A1,B1-1)

So, what the formula in C1 does is take the string in A1, and chop of the first n letters, where n is taken from the formula in B1. Why are we subtracting 1 from that number? A first name doesn’t include the space, right? So we take one shy of the number up to the space.

Putting It All Together

In this case, you can probably see how to smoosh everything together. As we shall see later on, this isn’t always possible. Sometimes, you need interim solutions. But here we can put it all together in a single formula in a single cell:

  A B
1 George Bush =left(A1,find(” “,A1)-1)

If you enter this formula and press enter, you have the first name.

Right!

What about the last name? If you think there might be a function RIGHT to go with LEFT, you would be right. So, all we need to do is figure out how many characters on the right we need. First, let’s make a useful mistake:

  A B C
1 George Bush 7 =right(A1,B1-1)

If you enter this formula, you will end up with “e Bush.” The FIND function only gives us the location of the space as a number of characters from the start (left-hand side) of a string of characters. How do we find the number from the right-hand side. Naturally, it is the total length of the string, minus the number of characters on the left-hand side. Luckily, we can get the total length of the string with the LEN function:

  A B C D
1 George Bush =len(A1)    

This gets us the length of the string. We still need to know where that space is:

  A B C D
1 George Bush 11 =find(” “,A1)  

Now we have the two pieces we need, and we can get the right-hand part of the string we want:

  A B C D
1 George Bush 11 7 =right(A1,B1-C1)

And we end up with Bush. More importantly, just about any first-last combo you put in A1 will yield the correct last name. (If you try “Sting” or “George W. Bush” you run into troubles, but we’ll deal with one of those in a moment.)

Of course, we can again smoosh this all into a single-cell for economy and neatness:

  A B
1 George Bush =right(A1,len(A1)-find(” “,A1))

Assuming that your list of names has only one space, you are in good shape. Just copy the formula in B1, and copy it down the whole list of names:

  A B
1 George Bush Bush
2 Mickey Mouse Mouse
3 Benedict Arnold Arnold
4 Judas #VALUE!
5 George W. Bush W. Bush

Uh, oh! Some problems toward the bottom there. But generally, we have a solution as long as there is a single space with a first and last name. If that’s your case, you can stop now. Otherwise, things are going to get…

A Little Bit More Complicated

What do we do with a problem like “George W. Bush”? Since FIND looks for the first occurrence of a space, we end up taking everything right of the first space. Well, we could do that again and again until we had no space. That might work. What would that look like? Well, we start with our initial plan:

  A B C D
1 George W. Bush =right(A1,len(A1)-find(” “,A1)    

This gets us “W. Bush”. Now, we can copy the formula in B1 to C1. When we do this, it automatically adjusts so that it is looking at B1 instead of A1, so we end up with:

  A B C D
1 George W. Bush W. Bush Bush  

Alright, so we’ve solved the problem of the middle initial!

Mixing It Up

But what if, like in the earlier example, some of our names have middle initials and some do not. Let’s try copying our formulas in B1 and C1 to a list of names:

  A B C D
1 George W. Bush W. Bush Bush  
2 Jimmy Carter Carter #VALUE!  
1 Che Guevara Guevara #VALUE!  

Well, it is getting the names, but they are all over the place. Sometimes, the correct name shows up in the B column, and sometimes the C. But we’re nearly there.

Getting In Line

OK, so shouldn’t we be able to put these all in the same column somehow? Of course we can! Let’s try again to state what we want to do as clearly as we can in English:

If there is an error in the C column, use the name in B; otherwise, use the one in the C column.

Naturally, computers are really good with statements like that. We have two new functions we’ll use to make this happen: IF, and ISERROR. IF is just that, it checks whether something is true and if it is, it puts something in a cell. If it isn’t, it does something else. ISERROR tells you whether an error appears in a cell. So we now have everything we need to make that statement. (We could do this in two separate cells, but I’m going to make a great leap forward, and just put it all in one. Generally, though, if something isn’t working, for example, you want to try in multiple cells first.)

  A B C D
1 George W. Bush W. Bush Bush =if(iserror(C1),B1,C1)

In D1, up pops “Bush.” Why? Because the cell checks to see whether an error appears in C1. If it did, it would print out the contents of B1. But since it doesn’t, it prints out the contents of C1. What happens if it does find an error in C1? Well, we can copy and paste to find out.

  A B C D
1 George W. Bush W. Bush Bush Bush
2 Jimmy Carter Carter #VALUE! Carter
1 Che Guevara Guevara #VALUE! Guevara

If you are thinking that there is a way to mush this all into a single cell, you are right, but I’ll leave that as an exercise to the reader. Also, we haven’t made the world safe for the single named, like Madonna and Sting. If either of these people are on your list, you should just remove them; they are already on plenty of lists. (NB: If you extend the logic we have used above, you can avoid deleting them.) Also, if you use Che’s real name (Ernesto Guevara de la Serna), you run into real problems, but parsing out extended names in the Hispanic tradition is something best left to humans.

Cleaning it up

Truth is, in many organizations, if you give this to your boss, he or she will be unhappy. Trust me, I know from experience: folks often do not like formulas in their spreadsheets. Especially if they didn’t write them, they often find them to be confusing or capricious. So, it may be worth locking these down as values. Highlight all the formulas that have generated your last name. Copy, and then Edit>Paste Special, selecting “Values.” Now you have a column that looks to all the world like you spent the afternoon typing surnames.

This entry was posted in General and tagged , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

10 Comments

  1. Posted 7/12/2006 at 12:42 pm | Permalink

    why didn’t you just do a column split based on spaces

  2. Posted 7/12/2006 at 1:01 pm | Permalink

    You could probably do the same thing by opening the file as text file space delimited. This way you would handle the Middle name easier without formulas.

  3. Posted 7/12/2006 at 1:17 pm | Permalink

    Mostly this was an excuse to introduce the idea of functions and of working with something over a number of columns, and the idea of thinking procedurally about a problem. Of course, splitting by space (either Data > Text to Columns) or importing with a column split would work as well as the formula, but wouldn’t give me that chance…

  4. Posted 7/12/2006 at 4:37 pm | Permalink

    uhm… is this the most genius bit of bloghumor I’ve seen in eons?????? two meanings to this post, cool excel how-to and cool point about….

    ….even better if intentional.

  5. Posted 7/16/2006 at 1:02 pm | Permalink

    If only you didn’t hate Perl :)

  6. alex
    Posted 7/19/2006 at 2:20 pm | Permalink

    I don’t hate Perl. And like Perl, this can be done in a little bit of Python. Or, given that we are within Excel, in Basic. But the idea here is that most readers don’t know any of those. I’ve suspected for a while that Excel might be an interesting way to teach non-programmers how to program.

  7. Posted 7/20/2006 at 3:21 pm | Permalink

    Sorry, didn’t mean to be rude :) The notion of using Excel to start teaching programming is an interesting one. The programmer in me shudders at the notion, but that’s just cultural and it really makes a lot of sense, as it’s relatively readily available (and OpenOffice *is* readily available as an alternative).

    Here’s another interesting example: calculating the Fibonacci sequence (a common learning programming exercise) in Excel is pretty straightforward: simply put a 1 in cells A1 and A2, put the formula =(A1+A2) and copy that cell and continue to paste it down the A column, which will transpose the formula effectively.

    The only problem is that Excel, unless you get into using Visual Basic (and why not, I guess), is sort of a stateless, declarative programming language, which doesn’t easily translate to something like Python or Java.

    I’m not sure how far you could take this, but it would be an interesting way to start an intro to programming gen-ed type class and would leave the students with some useful skills rather than simply being able to write a card shuffling program in Pascal :)

  8. Posted 7/20/2006 at 3:24 pm | Permalink

    Sorry, in the Fib sequence example, it should have said to put the formula =(A1+A2) in cell A3.

  9. Carlos
    Posted 7/28/2006 at 1:48 pm | Permalink

    What happens if, like in many Latin communities, the list includes both surname and mother’s maiden name, in addition to the first name? Or if instead of a middle initial, the full middle name is included? How can we adapt the formula to accomodate this situation? As far as I know, the find function will locate the first occurence only.

  10. alex
    Posted 7/28/2006 at 3:06 pm | Permalink

    Carlos: That’s why I included Che as an example up there. It’s really hard to do it in this case, because culturally, you are able to identify where given names end and surnames begin. I don’t think there is a way to do this without resorting to a dictionary of names, which gets to be pretty complex, and not super-reliable.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>