Tuesday, July 22, 2008

Dot Com Portals: Smart Searching

Most customers that purchase Aqualogic User Interaction (or Plumtree, or WebCenter or whatever the portal is known as tomorrow) decide to implement it as an intranet portal. This is the easiest way to produce a collaborative user experience and deploy the product as an employee-facing pilot internally. With the advent of ALUI 6.0 and advanced UI tweaking through adaptive tags, more and more customers are pushing the portal outside the firewall. Suntrust Bank (http://www.suntrust.com/), City Of Eugene, Oregon (http://www.eugene-or.gov/portal/server.pt) and Safeco Insurance (http://www.safeco.com/) are just some examples of public facing "dot com" websites that are powered by the Aqualogic portal.

Aqualogic does a solid job of searching content out-of-the-box but an issue quickly arises when contemplating how to properly search a public-facing dot com site with the portal's search engine. Most public-facing dot com websites employ a good amount of Publisher-based content.
Coupling structured data entry templates with well-defined presentation templates, a customer can build a large number of dot com web pages through a series of content items and published content portlets. Since the portal natively indexes Publisher content items and returns results
that link directly to the item itself, a standard search will not suffice for a traditional dot com website. The user would enter a search term, click on a result and immediately find themselves viewing a single content item outside of the framework of the website (no header, footer, LiquidSkin navigation, etc...). An alternate crawling option must be explored if we want to direct the user to the actual web page in which the content item resides.

The easiest way to go about this is to point a web crawler to the home page of the portal and crawl in all pages. This would generally be an accepted solution if site navigation didn't get in the way. All the pages would be full-text indexed, capturing all the content on each page. Unfortunately this includes ALL the content. The site's navigational links would be included in the full-text index of each page. Therefore if someone was searching for "Customer Service" on a dot com portal site and a link to the customer service page was in the navigation on every page, the user would
be presented with a result set of every page within the entire site. Most search terms would return proper results but searches run against
terms that are featured in the header, footer or other navigational pieces of the site would return vastly skewed results. How do we properly crawl a dot com portal website without crawling the navigation? I came up with the following solution while working for a recent client.

Without getting down and dirty and working on coding a very intricate custom crawler, I came up with a way to avoid navigation using the portal's standard web crawling functionality. The real key to making this work is to create an alternate experience definition that has the portal's header, footer and any LiquidSkin navigation removed. Simply copy the existing dot com website (public facing) experience definition, rename it with a "Search Experience" designation and delete the header and footer portlets. The next step is to create a new experience rule for to point users with a specific IP address range to the Search Experience definition. In setting up the rule, you should use the IP address(es) of your portal's Automation Server(s). With these two experience pieces in place, a web crawler running in a job on your automation server will crawl against the alternate experience definition and not index any text found in the portal's navigation. This method will also allow end-users to search your dot com website and click on results that take them directly to the web page in which the content resides.

The "catch" in using this alternate experience definition method is that you cannot crawl the dot com website directly from the home page. Without navigation in place, the crawler would not be able to follow the links on your site and properly crawl every page. A workaround for this issue is to create a static HTML file somewhere on your portal's imageserver. This file will just contain straight links to each community page in your site. This may seem cumbersome at first but it is fairly easy to run a SQL query against the portal database and return a list of object IDs for each page. From this list you can use search and replace in a text editor to create the proper link format. Once the HTML file is built correctly and placed in an accessible location, simply point your portal's dot com website crawler to the HTML page and have it follow one level of links. For my recent client we currently have a static list of all pages in the site, but we plan on coding an automated page in .NET that queries the database to build the list of page links on-the-fly. Such an effort is not very intensive as the SQL query is already written. Server-side processing is not an issue as well since the custom .NET page will only be accessed once a day when the crawl runs.

The end result of this alternate method of searching is a complete fully indexed repository of content that allows users to search against all pages on the website while avoiding any "throwaway" hits that would be generated by the site's navigation. If you decide to go this route in applying search to your public-facing dot com portal and have any questions, don't hesitate to ask. Additionally, if you would like the SQL query used to pull links to all the community pages within the portal, drop me an email and I can get that to you.

Happy "Smart" Searching.


Geoff said...

"With the advent of ALUI 6.0 and adaptive layouts" adaptive layouts didn't come out till 6.5

Jordan Rose said...

Nice find Geoff. I meant adaptive tags (PT tags) instead of adaptive layouts. I made the edit to my post.

calfre search said...

Hi, I am really happy to found such a helpful and fascinating post that is written in a good manner. Thanks for sharing such an informative post.
Oracle Project Accounting Training in Hyderabad