At the AICC meeting in Sestri Levante (thank you Giunti) Tom King and I gave a presentation on Search as a Mode of Learning. Our goal is to call out search as a key area of interest for the integration of learning systems. We began by providing a high-level description of how search works and looking at several possible patterns for search integration. Of course integration is only part of the picture, the ability to preview content before committing is a related enabler of effective search, and we discussed several patterns for previewing as well.
I follow developments in the search space closely, and you can keep up with these by tracking my Crowdtrust memos tagged with 'search'.
Integration is emerging as a key issue for learning for a number of reasons.
- There are a rapidly growing number of resources available to support learning in all of its many guises, no one source has or will have all possible resources, unless perhaps some combination of search engines.
- Acceleration in the pace of business, globalization, and speed in the dissemination of research results all mean that the resource that was relevant yesterday may not be the best reference today.
- Personalization of learning also means that there will seldom be a one-to-one mapping of the best learning resource to a specific learning objective for all people.
- Integration of learning into other processes, from research and development, to analysis, to execution makes it important to have very flexible mechanisms for content integration.
Search and search integration are, to my mind, the best way to address all four of these challenges.
To set the stage for a discussion of search integration, Tom and I described search as having five basic steps.
- A spider (or crawler) goes out and searches the Internet, following links, and sends information back to its home base. The spider collects a variety of information, the normal natural language processing information about word stems and syntax, site and page metadata, in some cases information about presentation structure, and of course link structure. A rich mix of information about each page and its relation to other pages (and in some cases the level of information is even more specific, down to the paragraph on the page or lower). Modern spiders can even work through sites built using Adobe Flash, and work is progressing on audio and video files (images are still tricky).
- Back home, an index is built, organizing and compressing the information from the spiders and sometimes weaving in other information, human judgements and categorizations, or how users have reacted to previous searches. The index is optimized for search and for certain types of queries. There are many ways to build and to store this index, relational databases, tree structures, semantic datababses (there may even be some people out there using object databases, though I am not sure how this would help in the case of search, and I have a fondness for object data bases). One of the goals of the search world should be to promote many different approaches to indexing.
- Queries are used to search the index for specific information. The query could be a simple text string, perhaps using some regular expressions, or if a relational database is being used in the index a SQL query would be built. Semantic indexes are using either XQuery to search the XML representation or are beginning to use SPARQL to go directly after the RDF or OWL. In most cases, the mechanics of the query are hidden from the user, who enters a simple text string or fills out fields from which a query is built.
- The search results are then packaged up and presented to the user. Most often this is as a weighted list, with the most probable responses to the query at the beginning, and lower ranking results strung out below. More sophisticated systems try to categorize search results, and there are many experiments going on with visual representations of search results. Tag clouds are becoming an increasingly common way to display search results, as they make it easy to display several dimensions of relevance and people are getting used to this way of presenting information.
- Search results can then be organized by the user, reweighted, recategorized, commented on or tagged, and shared with other users. In advanced systems, these user actions are fed back to the search system that incorporates them into the index for use in judging future queries.
With these five steps in mind - spider, index, query, present results, organize results - we developed a set of patterns for discussing search integration for learning. This approach is based on Gregor Hohpe and Bobby Wolf's useful book Enterprise Integration Patterns and the supporting website. These patterns can be combined in various ways to provide compelling search integration solutions.
Pattern 1: Allow Spidering
This is the simplest and the most effective pattern. Allow spiders to go through your content and send information back to various indexing systems. There are a number of best practices to designing texts for ease of spidering, use clear metadata, make all relevant text available to the spider, label all media not accessible to the spider, design your links to support search. These are basically similar to search engine optimization (SEO) approaches, although the goal is subtly different. In search for learning the goal is to maximize the return of highly relevant search results for the specific learning context.
A couple of points to note here. The importance of links in modern search makes it important to have links and to think carefully about their design. The SCORM approach in which a SCO (Shareable Content Object) can not launch another SCO, which in some cases limits one's ability to weave patterns of links between SCOs, should probably be scrapped. Future versions of learning standards need to support rich linking of content and multiple navigation systems, including systems based on search (whether this search is visible to the learner or not). And in some situations, people may not want to expose their content to spiders, even friendly spiders that have been vetted and allowed in. In this case, other search integration patterns will need to be used.
Pattern 2: Publish Metadata
This is the pattern that has been adapted by most learning standards and learning systems to date - publish explicit metadata, either in a general format such as Dublin Core or a learning specific format such as the IEEE LTSC LOM (Learning Technology Standards Committee Learning Object Metadata). And to date it has worked poorly. Most of the systems that are supposed to read this metadata choke on it (the learning content management systems tend to do the best job) and the quality of the metadata itself varies from good, to good but irrelevant to just awful. For most of the learning content I have reviewed the quality of the metadata falls somewhere between awful and appalling.
There is a role for explicit metadata of course, it provides a content creator with a way to communicate specific information to users. And user generated metadata has an important role to play as well, especially when it is used in social bookmarking systems that can aggregate user keywords and comments. But alone it will never be enough tu support full search integration and address the four problems noted above: resource proliferation, rapid change, personalization, integration into multiple processes.
Pattern 3: Publish an Index
So if formal metadata is not enough, and for some reason you do not want to let spiders in (or you don't have direct access to the content and can not put your spiders in), then what do you do? The simplest approach may be to generate your own spider data and publish it. This is similar to publishing metadata, the difference is that the file produced includes a great deal of additional information (natural language processing data, presentation structure, link structure) and is meant to be included into an index rather than read directly by another human.
It may be helpful to develop some standards around this, or at the very least to recomend some spiders (preferably open source) that people can run against there own content and use to publish indexing information. This is an area that I would encourage the AICC, ADL and even the IMS (to pick the three most commonly referenced organizations involved in learning specifications) to look into, and perhaps provide sample spiders and outputs for reference purposes. But an excess of specification could limit innovation in an area that is developing very quickly, so the best approach at this time is simply to identify index publishing as a pattern and share code and best practices (of course an even better approach is to let many different spiders go through your content, so that it can be included in many different indexes with many different structures and searched by many types of queries).
Pattern 4: Publish a Syndication Feed (RSS or Atom)
What do you do if your content changes frequently, so that no formal metadata or occasional spidering will do it justice. Fortunately, this is a common situation and there is a common approach to solving it: content syndication using RSS or Atom. There are even proposals available for extending RSS and Atom to cover common metadata formats and they could easily include indexing files (which I am using to mean the files that contain the natural language processing, presentation and link structure data used in search).
In some of the more dynamic and adaptive approaches to learning content that are emerging the learning resource is not a piece of static content but an RSS/Atom feed, or even a combination of feeds, or a sophisticated packaging of search results.
Pattern 5: Federated Search
Finally, we get to the most popular proposal, some form of federated search. In federated search there are ways to pass search queries and results from one search system to another. There are several ways to do this, by exposing a conventional API, by using some form of web service, whether RESTful or WSDL/SOAP based, or even through one of the existing forms of database integration. The logic is often "I understand my own content better than anyone else and I know how to search it most effectively, so pass me a query and I will give you the most relevant results." In some limited number of cases this is even true, and federated search should be standardized and supported within the learning industry. Skillsoft in fact has begun to do this and has a search service as part of OLSA (Open Learning Services Architecture). Shota Aki had a very good presentation on this at the AICC meeting in Sestri Levante, see the presentations for Tuesday June 5.
I do not think this is the best integration pattern though. It assumes that (i) the content side search is as powerful as other search engines, and given the rapid pace of search engine development this is unlikely to be true in most cases. It also assumes that enough information can (or will be) passed in the search query for effective personalization of the search. Again, I find this unlikely. Finally, (iii) this pattern will not allow as much diversity and experimentation and its broad adoption will slow down innovation around search in the learning industry. The long-term result of this is likely to be that people will avoid search systems that are constrained to learning systems and rely on the open web, where evolution has been much more rapid. Indeed, given that many people default to Google when they need to learn something for work, one could say this has already happened, and that the learning industry is playing catch up.
Patterns for Previewing
Integration of search queries and results passing is only one part of search integration. Another important theme is previewing. When I am returned a set of results from a search query I often want to preview any specific result before committing to selecting it. Preview supports much better selection for the learner, who may hesitate to commit to a resource just on the basis of the information produced in the typical search result. This is especially so when there is some cost involved in accessing the resource, whether this be a time cost, a bandwidth cost, or an actual financial charge.
Pattern 1: Include a Sample
A short indicative sample is included in the search result. There are several sub patterns for obtaining this sample, the content provider can indicate this using description metadata of some kind or the spider can send back enough information for the search system to build a sample on its own. Sometimes both approaches are used. In some cases the index is full of rich information and different samples can be constructed for the same piece of content depending on the search.
Pattern 2: Provide Access to a Sample
In some cases it is difficult to include a sample in the search result. Their may be IP protection issues, or the media may simply not lend itself to this (video, a location in Second Life, etc.). In this case, access to some form of sample can be provided as part of the search result. Sub patterns are that the preview could be content placed on a specific sample site for this purpose, or direct access could be provided to the content with some restrictions, such as time or the extent of access.
Conclusions and Suggestions
Search is likely to play a growing role in learning, as it is in so many other areas. Everything from navigation on physical and virtual spaces, to business intelligence, to data integration is being rethought in terms of search and the organization of search results. As more and more aspects of the world, or physical and virtual worlds, are described or even describable (advanced search systems are able to generate their own descriptions) search and related technologies become powerful vehicles for accessing the most relevant content in any context. It is not too much to say that search will replace most content centric learning solutions in the not too distant future.
Learning places specific demands on search systems that we are only beginning to understand. The search needs to be constrained by learning objectives, the learners evolving mental models and the current context - where a person is in a work process or what is happening in a collaborative group. A great deal more work needs to take place here.
Given this, the learning industry has a long way to go before it should begin to mandate specifications or standards for search integration. What is more important is that we take an open approach to search, open in terms of how we design systems and open in terms of how we allow other systems to access them.
The best approach to search integration is the simplest pattern. Allow other systems to spider your content and design your content so that spiders can collect the most relevant information and support the best search results. This pattern allows for the most diversity and diversity of approaches is what we need now. No system is ideal in all circumstances and the world of search engineering is developing extremely rapidly, with advances in semantics and social technologies changing the rules of the game. If for some reason you can not allow spidering then use your own spiders (ideally use more than one so that you can experiment internally) and publish files for indexing systems.
Their is a need to develop a common way to pass queries and search results. Here Skillsoft has taken the lead with its Open Learning Services Architecture. The best resource I know of to learn about this approach is the presentation that Shota Aki made to the AICC in June, 2007 at Sestri Levante. It can be found among the June 5 presentations here.
Additional investigations into previewing for different types of media will also be valuable. If you have ideas here, please contact Tom King (the AICC blog is probably the best way to do this) or leave a message here.
By opening our systems and content to support better search we may find that we solve other integration problems as well, as search is rapidly evolving solutions to many types of problems.

