Sunday, June 5, 2011

The Balancing Act

In http://sightsplanner.blogspot.com/2011/05/rating-tours.html we discussed the method of rating the output of heuristic search so that the tours with individually highest-ranked items would be preferred. Having a reliable rating method is the basis of guiding the search and selecting the best from the candidate solutions.

One method of differentiating the tourist attractions in Sightsplanner database is their popularity property - a number that describes the relative activity of tourist visits to the attraction. The most active site currently has 2500 daily visits, while some obscure sites may go as low as 1. A typical museum might have 100 visits. The popularity property is aggregated into the final item score, so that with two otherwise identical items, the one with a higher popularity property will receive a higher score.

Although the ability to differentiate between similar-looking (to the machine) churches or museums is highly useful for enhancing the user experience, it also introduces side effects. In a historically attractive town like Tallinn, historic sites like medieval architecture are far more popular among tourists than sports and outdoor activities - in other words, items in some categories will tend have a higher score on the average than other categories.

From the statistics taken from Sightsplanner database of 835 tourist attractions, dated 01.03.2011, the average values of popularity property were the following (click on the image to see the graph in it's full glory):



The categories represent the seven sliders from Sightsplanner homepage. Even though these averages don't provide us with the exact statistical distribution, it would not be unusual if a large number of architecture-related items had higher popularity than the best sports-related item in the database.

So imagine we pushed both the "Architecture & City" and "Sports & Outdoor" sliders to the maximum, leaving the remaining ones at the minimum value, then ask Sightsplanner to give us a recommendation. Both categories of items should match to our preferences with similar level of confidence, but after adjusting for popularity, the top-ranked items would be mainly, or even exclusively, architecture-related. Applying the methods described in the "Rating the tours" article, the recommender would then accordingly attempt to produce tours that consist of architecture objects. Did this meet with the user's expectations?

To answer that question, we will take a step back and have a look at the input of the preferences from the recommender homepage.

1. Interpretation of sliders

Events: 0
Architecture & City: 1.0
Museums & Arts: 0.5
Eating out: 0

In this sample of sliders, 0 marks the leftmost position, indicating that there is no interest in that category. The rightmost position translates to the value 1 and the positions in between to fractional values.

In order for the recommender software to produce output that matches the user's preferences, we must, as the first step, unambiguously define what this configuration of sliders means. There are two plausible interpretations:

Interpretation 1: I would like to spend 66% of my time seeing architecture and 33% of my time visiting museums
Interpretation 2: I like architecture twice as much as museums (and hate the rest). Thus architecture sites on the average have a match score of S, museums have S/2 and the rest have 0.

If we could pick one of these as the optimization criteria and ignore the other we would have a straightforward way of searching for the maximal solution. As we saw, introducing the generally useful notion of popularity as an item score component produced the effect where one category could occupy all of the top spots in the item rankings - so using the first interpretation would help to ensure that the categories receive fair representation. The choice of time as the method of determining shares is arbitrary, but seems better suited than counting the items. On the other hand, the second interpretation remains one of the main design concepts of Sightsplanner and cannot be discarded either - in other words, we need to accommodate both interpretations.

The goals of the recommendation have then become:
- I would like each attraction in my tour to be as high-rated as possible
- I would like each category that I've expressed interest in, receive a fair share in the tour

2. Tackling the categories

The individual item ratings as an optimization criteria is already addressed by our tour ranking method. The redefinition of goals has upgraded our problem classification into the field of multiple-criteria optimization - we should now also observe whether all the categories present in the user profile are present in the candidate solutions and favor those where the proportions of categories by time are similar to what was indicated in the profile. We shall call this new criteria the balance criteria, as it represent the balance between categories.

There are two general methods of doing multiple-criteria recommendation - one is aggregating all the criteria into a single value and optimizing for that. The other is to treat the criteria separately, using a specially-designed search algorithm that produces a set of Pareto-optimal solutions. This set is then presented to the user.

In the current concept of Sightsplanner the recommender is expected to return a single optimal solution. For that purpose, computing the Pareto set does not appear to add much value, as automatically picking a single solution from the set would still require computing a single aggregate score for each of the set members.

Now that we know we need to arrive at a single rating value that would describe both the average individual item score, as well as the balance of categories. The formula we start with is:

m = sum(m_i) / (n + t_u / t_min)

(m_i is the individual item score, n is the number of items and t_u is the unused time. Also the notation of t_avg has been dropped in favor of t_min, the minimal visit time of a recommended item).

Observe that the score formula already contains the notion of "unused" time. This is a construct we can exploit further. Consider our target time shares of 33% museums and 66% architecture from the sliders example - if the items in the candidate solution indeed make up these proportions, we can call the time spent on this tour "well-utilized". If, on the other hand, the items would constitute of 80% architecture and 20% museums, we have an extra 14% of architecture that was not needed. In this case 86% of the time was well utilized.

This fits rather well into our score formula. Instead of looking at the average score of all the items, we only consider those items that contribute to filling the expected category time shares. The rest of the items are discarded, so that the search is no longer rewarded for piling items from a single top-ranking category. The time that was left over from filling the category shares is added to the otherwise unused time.

The new, Utilized Time Model formula:

m = sum(m*_i) / (n* + (t - t*) / t_min)

Where m*_i denotes the individual item score of all the contributing items, n* is the number of contributing items and t* is the the "well utilized" time. When there are more items in a category than required to fill it, the items are prioritized by their individual score - the higher-scoring ones are included first. The items are treated as discrete, so in case there is an item without which a category would remain unfilled, but adding it "overflows" the category, we allow the item to contribute to the score if more than half of it's visit time is useful.

3. Avoiding overdoing it

Sightsplanner computes the expected time shares of categories in minutes, so that a 100% share would be equal to the duration of the entire tour. This approach was chosen, because it makes the GRASP search inside the planner easier to guide - until the entire tour is filled with items, some categories will remain unfilled and the heuristic can prioritize the items appropriately while inserting them one by one.

If, on the other hand, we looked at the proportions items currently in the half-finished solutions, we might think that after inserting a single museum into the solution, we have 100% museums, meaning that we have grossly overfilled our 33% expectation. This can confuse the heuristic and produce anomalies(an item that is actually needed may get a weak heuristic score because it incorrectly appears to be overfilling a category).

When using time quotas, meeting the balance criteria exactly is likely to be too demanding. Even if the planner gets the item proportions right, travel time between different attractions makes it impossible to fill the quotas given in minutes. This can be compensated for by reducing the quotas by some predetermined amount - in other words, relaxing the balance criteria.

More importantly, though, it is likely that the recommended items won't fit neatly, leaving categories under- and overfilled. Let's look at a different configuration of sliders:

Events: 0.2
Architecture & City: 1.0
Museums & Arts: 0.2
Eating out: 0.2

Value of 0.2 would be a slider that is moved 1/5th from it's leftmost, "uninterested" position. Let's assume that the duration of the tour is long enough so that our Utilized Time Model formula properly rewards the planner for including items from "small" categories. The shares are so small however that they consistently get overfilled even by a single item. Together, these extra minutes assigned to events, museums and eating out take a large portion of time away from the architecture category. It again comes down to interpreting user expectations, but it is not unreasonable to assume that for a tour of short duration, items of low priority would be excluded, instead of including them at the expense of a high priority category.

To address this, we take a slightly different approach to relaxing the balance criteria. The quotas will be lowered more if the sliders are towards the "uninterested" position and less if they are towards the "interested" position. This creates some buffer time to travel between recommended attractions, as mentioned earlier, but mostly at the expense of categories that the user is less interested in.

We will omit the actual time quota table from Sightsplanner for brevity and just look at how our sliders example would be affected. A slider position of 0.2 is assigned a ratio of 0.5 and the slider with a position of 1.0 has a ratio of 1.0. Assuming that the tour length is 3 hours, we get the following time quotas:

Events: 180 minutes * (0.2/1.6) * 0.5 = 11 minutes
Architecture & City: 180 minutes * (1.0/1.6) * 1.0 = 113 minutes
Museums & Arts: 11 minutes
Eating out: 11 minutes

So everything except the architecture will be fairly insignificant and the planner won't be rewarded for recommending items in these categories, unless they have very short visit times (following the rule that at least half of the item's time should contribute to something for it's score to count, it needs to be under 22 minutes in this case).

4. Other considerations

Our scoring model allows one more simple method of refining the calculation - so far we've exploited the utilized and unused time constructs, the other major component is the individual item scores that contribute towards the average. These could be modified to encourage or discourage specific behavior of the planner. Let's look at an application for modifying the individual item scores.

Links are the part of the tour that fall between two waypoints (recommended items). Sometimes there is a remotely located, but very interesting attraction that would contribute a significant amount to the overall score. It is possible that the high score outweighs the disadvantage of having to expend a large amount of time to get there, as it's possible that all the alternative attractions are simply so much worse. In such case, a recommendation where the user has to, for example, walk 60 minutes from one attraction to another, is possible.

However, this would not be the case if the remote attraction's score would be lowered exactly due to it's remoteness. This is fairly simple to implement: if the link leading to an item is longer than some specified amount of time, the item score is lowered. The planner is no longer rewarded for including it and might have to look closer. Note that the penalty is specific to that particular link, so the item might still be a healthy choice if we later end up somewhere closer to it.

Specifically, the operator of the recommender can set soft and hard tolerance limits (such as 30 min, 60 min) for the link length - we'll start lowering the item score once the link reaches the soft limit, it should reach the minimum at the hard limit. The item score is then:

m*_i = \epsilon + (t_hard - l_i) (m_i - \epsilon) / (t_hard - t_soft)

\epsilon is some small score greater than 0 that makes this item more useful than not visiting anything at all; l_i denotes link length and m_i is the unmodified score.