Tuesday, July 22, 2008
Aqualogic does a solid job of searching content out-of-the-box but an issue quickly arises when contemplating how to properly search a public-facing dot com site with the portal's search engine. Most public-facing dot com websites employ a good amount of Publisher-based content.
Coupling structured data entry templates with well-defined presentation templates, a customer can build a large number of dot com web pages through a series of content items and published content portlets. Since the portal natively indexes Publisher content items and returns results
that link directly to the item itself, a standard search will not suffice for a traditional dot com website. The user would enter a search term, click on a result and immediately find themselves viewing a single content item outside of the framework of the website (no header, footer, LiquidSkin navigation, etc...). An alternate crawling option must be explored if we want to direct the user to the actual web page in which the content item resides.
The easiest way to go about this is to point a web crawler to the home page of the portal and crawl in all pages. This would generally be an accepted solution if site navigation didn't get in the way. All the pages would be full-text indexed, capturing all the content on each page. Unfortunately this includes ALL the content. The site's navigational links would be included in the full-text index of each page. Therefore if someone was searching for "Customer Service" on a dot com portal site and a link to the customer service page was in the navigation on every page, the user would
be presented with a result set of every page within the entire site. Most search terms would return proper results but searches run against
terms that are featured in the header, footer or other navigational pieces of the site would return vastly skewed results. How do we properly crawl a dot com portal website without crawling the navigation? I came up with the following solution while working for a recent client.
Without getting down and dirty and working on coding a very intricate custom crawler, I came up with a way to avoid navigation using the portal's standard web crawling functionality. The real key to making this work is to create an alternate experience definition that has the portal's header, footer and any LiquidSkin navigation removed. Simply copy the existing dot com website (public facing) experience definition, rename it with a "Search Experience" designation and delete the header and footer portlets. The next step is to create a new experience rule for to point users with a specific IP address range to the Search Experience definition. In setting up the rule, you should use the IP address(es) of your portal's Automation Server(s). With these two experience pieces in place, a web crawler running in a job on your automation server will crawl against the alternate experience definition and not index any text found in the portal's navigation. This method will also allow end-users to search your dot com website and click on results that take them directly to the web page in which the content resides.
The "catch" in using this alternate experience definition method is that you cannot crawl the dot com website directly from the home page. Without navigation in place, the crawler would not be able to follow the links on your site and properly crawl every page. A workaround for this issue is to create a static HTML file somewhere on your portal's imageserver. This file will just contain straight links to each community page in your site. This may seem cumbersome at first but it is fairly easy to run a SQL query against the portal database and return a list of object IDs for each page. From this list you can use search and replace in a text editor to create the proper link format. Once the HTML file is built correctly and placed in an accessible location, simply point your portal's dot com website crawler to the HTML page and have it follow one level of links. For my recent client we currently have a static list of all pages in the site, but we plan on coding an automated page in .NET that queries the database to build the list of page links on-the-fly. Such an effort is not very intensive as the SQL query is already written. Server-side processing is not an issue as well since the custom .NET page will only be accessed once a day when the crawl runs.
The end result of this alternate method of searching is a complete fully indexed repository of content that allows users to search against all pages on the website while avoiding any "throwaway" hits that would be generated by the site's navigation. If you decide to go this route in applying search to your public-facing dot com portal and have any questions, don't hesitate to ask. Additionally, if you would like the SQL query used to pull links to all the community pages within the portal, drop me an email and I can get that to you.
Happy "Smart" Searching.
Monday, July 7, 2008
An example of a deleted admin object causing other issues in the portal occurred at my client site last Fall. An administrator had deleted an old Publisher portlet that was in the process of being replaced by another portlet. A few days after this portlet was deleted, users started noticing that certain links in other Publisher portlets would result in a gateway error. After some detailed troubleshooting it was determined that multiple Publisher links throughout the portal were gatewayed through the removed portlet. Since that portlet with the referenced ID did not exist, the link was failing and the gateway error was being returned. In short, we needed to restore the deleted portlet to allow the gatewayed links to function properly.
An attempt was made to export and import the object through a PTE file from the test environment portal. This method did not work as the old portlet was imported but with a different object ID. The only "supported" way of restoring the old object with the proper ID is to restore the entire portal database. This is really not an option for certain portals that feature a highly active user base where content is changing hour-by-hour. A full restore of a database from two days earlier could result in several unwanted repercussions, including:
- Lost content that would have to be re-crawled or re-imported
- All portlet and community preferences that had been changed in the past two days would be lost
- Non-synched portal users created in the last 48 hours would need to be re-created.
- A whole slew of other issues that makes this option one to avoid if at all possible
In this emergency situation I decided to utilize the rarely used PTOBJECTCOUNTERStable in the portal database. This table simply functions as a counter mechanism for objects of different classes. When a new administrative object is added to the portal, the NEXTOBJECTID field of the PTOBJECTCOUNTERS table is referenced to determine the next object ID to use for the specified object class. This table is in place to prevent object ID conflict issues throughout the portal.
To fix the Publisher issue in my client's portal, I used SQL Enterprise Manager to change the NEXTOBJECTID value of the record associated with CLASSID of 43 (the class ID for portlets in the portal) from the existing value (which I copied to re-use after the fix) to the ID number of the removed portlet. I could determine the ID of the deleted portlet object by looking at the gateway URL referenced in the Publisher links that were not functioning properly. Once the ID number was changed, I re-imported the PTE file from the test environment. The portlet object was created with the proper ID and immediately all broken Publisher links in the portal were working without issue. I then changed the NEXTOBJECTID value back to the original value to ensure that all new portlets would be assigned a unique object ID.
This is just an example of a way the PTOBJECTCOUNTERS table can allow an administrator to correct object ID issues in the portal. PTOBJECTCOUNTERS is a simple table in design but an integral table in the makeup of the ALUI portal platform and can be a lifesaver at times.