In case of a slow sync or any addition of a supposedly new item it is necessary to find out if an identical or similar item already exists in the database. Usually this is what UIDs are for. But unfortunately, we cannot rely on a UID since an entry most likely will have two different UIDs when originating from two different clients. Or maybe the same entry has been entered in two different address books which were not able to sync until now. So we have to find identical items. But how similar is identical? Or otherwise, how do we know which items can be safely merged?
To find identical items we have to compare entries on a per-field basis. First of all, there are the two trivial cases: All fields identical and no fields identical. In the first case, one can safely assume that the same entry can be used while in the second case a new entry can safely be created.
We have to define a numerical “uniqueness” of every field to find out which items are identical. A phone number, for example, might be a good indicator for uniqueness. However, if two people share a work phone, this is not enough. But an E-mail address combined with a phone number? Or maybe first name, last name and phone number?
To add to this chaos, items might be more or less unique depending on the user. If you usually contact only one person in a company, a company name or a work address might be unique. Therefore, any algorithm must be user configurable.
The solution is to use a point system. Points are added for every identical item and subtracted for every differing item. An item that exists in one but not the other entry is ignored. If the points are more than a certain number, the items are taken as identical. Of course, the point distribution itself is fully user configurable.
Let us consider the following configuration:
Table 7.5. Example Setup
Points needed: 25
And the following user Entries:
Table 7.6. Example Data (Client)
|1. Entry||2. Entry|
|Phone;Home||089 / 8971xxxx||089 / yyyyyyyy|
Table 7.7. Example Data (Server)
|1. Entry||2. Entry|
|Phone;Work||089 / 289 - zzzzz|
|Phone;Home||089 / yyyyyyyy|
When comparing these entries we get numerical results. The following table tries to visualize this: we draw a matrix, putting the entries originating from the client database on top and those from the server database on the left side. The table contents in the middle are the comparison points:
Table 7.8. Comparison Points
|Max Berger||Test User|
|Max Berger||+10+10+10 = 30 > 25||-20-40 = -60 < 25|
|Another User||-20-40-20 = 0 < 25||-20+10+10 = 0 < 25|
In this case, both “Max Berger” entries are considered identical while “Test User” and “Another User” are considered different. Both “Max Berger” items are merged and now we get the following results in the server:
One problem with the point system is that every entry in one database has to be compared with every entry in the other database. In my personal setup with about 100 contact entries this multiplies to 10,000 comparisons. This is far to much.
The solution: Find some kind of preselection. A field that, if present, usually does not differ on different clients. And it should be a field that is present in almost any entry. Possible fields are:
- First Name
Unfortunately a first name has often different spellings. Most people use nicknames instead of the real first name, and might not do so on all clients.
- Birth date
A birth date never changes. Unfortunately, birth dates are usually not the thing people put on their business cards.
- Last Name
There are only two ways a last name changes: either by marriage or when it is simply misspelled. It last name could also differ if it is not set.
So the decision is on the “Last Name” field: Entries are only considered for comparison if the last name equals. This lets us optimize the database for last name comparison. In my personal setup this reduces the comparison of entries to one or two in the most cases, and once up to six. This reduces the number of full comparisons needed to about 150