Monday, May 9, 2011

Rating the tours

Sightsplanner is built around the premise that we have an idea what the user is interested in and are able to rank individual items - more specifically, tourist attractions - from our database according to that. Once we obtain such rankings, we have learned what the best-matching attractions are and putting the tour together should be a cakewalk.

Let's give it a try. Say we have 3 hours and our interest-matching calculation has provided these items from Tallinn city center:
(1) Oleviste church steeple (1.0), visit time 30 min
(2) Niguliste church (0.8), visit time 60 min
(3) Olde Hansa restaurant (0.7), visit time 100 min
(4) Viru shopping center (0.3), visit time 60 min

The numbers (like 1.0) are match scores, meaning that the Oleviste steeple appears to be best matched to the user's interests and besides churches, the user also has interest in dining and shopping.

It is evident that not all of these items may be visited, as the total time would exceed 4 hours, while we only have 3 at our disposal. So we need to take some items and discard the others, making sure we get the most out of our available time - a combinatorial optimization problem.

There are many techniques to solve this class of problems but there is usually one thing in common - we need to compare one candidate solution to another to find out which one is the best (at least, of those we have examined so far), so a numerical value is assigned to each - a metric or criteria.

1. The obvious approach

So what happens if we just add the individual item ratings together? Indeed this looks like a natural way of giving an overall rating to the tour. If we visit better attractions, the sum of ratings must increase and also, this should favor the solutions where the items are packed into the available time as efficiently as possible.

m = sum(m_i)

Where m_i is an item included in the candidate solution.

Looking at our example, the combination of (1), (2) and (4) appears to have the highest m = 1.0 + 0.8 + 0.3 = 2.1 - every time we try to include (3), only one other item fits which gives us a maximum m = 0.7 + 1.0 = 1.7 for solutions including Olde Hansa.

Incidentally, if we use this scoring method, we have modeled our tour-building task as equivalent to what is called the "0-1 knapsack" problem in the literature. The knapsack in the name refers to the anecdotal description of the problem where a thief has a limited size knapsack and tries to fill it with most valuable items possible, while each item also has an amount of space it takes up. The "0-1" part comes from the more formal mathematical description and means that each item may either be taken or not taken (but you're not allowed to cut them up).

At this point we may congratulate ourselves on a job well done and start programming our elegant little scoring model into our recommender. The user, on the other hand, may be left wondering why the program sent him/her shopping while dining had a higher priority. Turns out that this approach has a natural bias for items that have short visit time, as it is possible to fit more of those into the available time - and this bias isn't quite intuitive and acceptable to the user, leading to unsatisfactory user experience.

In our first example, the situation was somewhat ambiguous, as we had two possible categories of items to choose from and given the limited time, we picked the less time-consuming alternative, while the discarded item wasn't even the highest-scoring one. However, this bias can degrade the quality of recommended tours much more severely. Let's look at three museums:

(A) 100min score 1.0
(B) 30min score 0.6
(C) 30 min score 0.6

If we have 100 minutes to use, including (A) would limit us to a maximum score of 1.0. Picking (B) and (C) instead, we increase the minimum score to 1.2 and have 40 minutes to fill with more items! So, we have a high-scoring item (A) that the user will never see in recommended tours. In reality, people are visiting that attraction however, furthermore it is so interesting that they are spending a lot more time on the average there - so our recommender must have failed somehow.

Here one might argue that it's possible to differentiate scores more, so that weaker items are discarded anyway. However this approach is somewhat artificial and does not remove bias for shorter visit time among "equal" items. Also, we would like the tour-building phase of the recommendation to work somewhat independently of the calculation of individual item rankings.

2. Compensating for time creates a different kind of bias

Another intuitive approach to rating the tours might be saying, "I would like to spend each minute of the tour at as high-scoring attractions as possible". From the user experience point of view, this statement looks promising. So, let's see how we can achieve this. If we spend 30 minutes at an item with a score of 1.0 and 60 minutes at an item with a score of 0.4, we could say that on the average, we spent each minute at a 0.6-rated item. So, calculating the overall rating for the candidate solution is again quite simple:

m = sum(m_i * t_i) / t

Where t_i is the visit time of the attraction and t is the total time. Within the calculation of one recommendation, t can actually be dropped, as we always have the same amount of time available for the tour.

Revisiting our museum example, after compensating for the visit times like this, we get:

(A) 1.0* 100min = 100
(B,C) 0.6* 30min = 18

So, we have remedied the earlier flaw and one good item is now superior to 3..4 mediocre ones when calculating the total score. When using this new formula to rate the solutions to our first example, we get that our earlier winner (1),(2) and (4) scores m = 1.0*30 + 0.8*60 + 0.3*60 = 96. Instead, picking (2) and (3) gives m = 0.8*60 + 0.7*100 = 118 - a new best solution. This suggests that there is no more bias against the long visit time; while we have lost the individually highest scoring item, considering the very limited combinations available in the example this does not necessarily indicate a flaw of the scoring model - if there was twin item (1a) with the same properties as (1), those two could be inserted to replace (2) and the overall score would increase.

Yet there is a problem with this calculation method also and to understand why, we need to include the search algorithm that actually builds the candidate solutions. In a real-world application, having just 4 matching items is unlikely and would even defeat the purpose of automated recommendation - more likely we would get 40 or 400 matches to the particular user's profile. When that happens, we no longer have computing resources to try out all the possible combinations and need to rely on approximations of what might be the best solution.

Enter the heuristic search. If we look at a state-of-the-art meta-heuristic like GRASP (Sightsplanner also uses a derivative version of it), it has some characteristics of a simple greedy search in that it tends to pick higher-scoring items first - the idea, of course, being that this guides us to a best solution faster.

In the above museum example this would be quite fine, as (A) is likely to be picked first and as long as the search runs, we could try combinations that include (A) and something else. However, the scores may be distributed the other way:

(D) 0.5 * 100min = 50
(E,F) 1.0 * 30min = 30

Now, the 100 min item (D) looks like a more attractive first choice for the search algorithm. Of course, given enough time, the search could eventually find a different combination that includes (E) and (F) yielding a higher total score, but only exhaustive search could guarantee this. In practice we have to aim for a search technique that converges towards a "good-ish" solution fast.

3. Average score looks nice

So, getting rid of the dependence on item visit time has become desirable. Let's revisit the statement of the goal of the search for the best solution - perhaps if we reword it as "I would like each attraction in my tour to be as high-rated as possible", we are not distorting the meaning from the user's perspective. Mathematically, we get quite a difference:

m = avg(m_i)

Our first and second formula could also be expressed as average score, however this one is explicitly the average item score. Again applying that to our candidate solutions, we find that the winner by first formula, the solution consisting of items (1), (2) and (4) gets m = (1.0 + 0.8 + 0.3) / 3 = 0.7 and the winner by second formula scores m = (0.7 + 0.8) / 2 = 0.75. Checking our museum examples, obviously item (A) always yields higher average than (B) and (C), while from the search aspect (E) and (F), when using their individual item rankings, remain preferable to item (D).

The above looks very nice, but there is a catch - so far our model has been too perfect and we have ignored the real-world aspect of having to walk (or drive) from one attraction to other, attractions having opening and closing times and even the time available to perform the search expiring before something sensible has been produced. So, in a real-world application it is possible that tours containing a lot of "waste" time, such as picking inefficient routes between attractions, are produced. Should a candidate solution containing just item (1) appear somehow, it would win by virtue of it's unbeatable m = 1.0, even though 2.5 hours of the tour would be completely wasted.

Search algorithms can be guided to always use the available time, but from the aspect of just being able to compare all the conceivable candidate solutions, the problem does not go away. On the other hand, we can attack the issue from the scoring point of view in a fairly straightforward manner.

Let's introduce variable t_u that denotes the total time not spent at an attraction. We might now calculate some sort of a penalty to the overall score that increases when t_u increases and is 0 when the t_u is 0 (or low enough). This would put us on a slippery ground of having two criteria - "spend as much time at attractions as possible" and "have the highest possible average". The implications of this are beyond the scope of this article, but considering we get two sometimes diverging scores for each candidate solution, combining them together may require some sort of weights (of emphasis, relative importance), which leaves us with an administrative task of tuning these weights to best match our database of items and expectations for recommender output.

Instead, let's pretend that we did spend the unused time at some attractions - since they were completely useless, they contribute 0 individual item score each. Our formula then becomes:

m = sum(m_i) / (n + t_u / t_avg)

Where n is the number of items in the candidate solution and t_avg is some amount of time we have assigned to visiting these imaginary items which could also be derived from the available item data. We consider the imaginary attractions as discrete items, so the remainder of t_u/t_avg is ignored.

Note that we haven't quite avoided using a parameter to aggregate the evaluations of two potentially conflicting criteria as stated earlier. The difference now is that we can predict the behavior of the recommender somewhat - if there is something, anything with a score higher than 0 to insert into that unused time, the recommender will try to do so. The extreme cases where one high-scoring item alone beats a number of very low-scoring ones remain - without collecting statistics on recommendations it is too early to state whether this is an issue that still needs addressing.

For the sake of completeness, here are the scores for candidate solutions we have found so far, using t_avg = 45 min: items (1), (2) and (4) m = (1.0 + 0.8 + 0.3) / (3 + 30/45) = 0.7; items (2) and (3) m = (0.7 + 0.8) / (2 + 20/45) = 0.75; a solution consisting of just item (1) m = 1.0 / (1 + 150/45) = 0.25.

No comments:

Post a Comment