‘Natural’ searching of database-sourced websites

 

Database-sourced websites provide powerful facilities for searching according to complex criteria. However, generalised keyword-style searches – Google-style – are not straightforward to implement.

SQL queries which are used to interrogate a database operate in quite a different way from website search engines.

These notes set out a method for implementing such general keyword-search facilities in database-sourced websites.

(Note: to make sense of the following, you will probably need reasonable familiarity with SQL queries and regular expressions).

The problem

  1. SQL queries facilitate general substring searches, but not ‘whole word’ searches
  2. Multiple-word searches across different fields are difficult

‘Whole word’ searching

SQL select statements allow substring searches along the lines
    Select * From tbl Where fld Like '*salt*'.
However, if users were looking for references to salt and this returned results referring to, say, psalter, users might be a little surprised, not to say confused.

To find complete words only, the query term could be surrounded by spaces, as in
    Select * From tbl Where fld Like '* salt *'.
However, it wouldn’ t find ‘salt’ at the end of a sentence, where it is followed by a full stop rather than a space - which would not be acceptable. The same applies to all punctuation. Similarly, if the search term appears at the beginning or at the end of the field, it will not be matched.

Multiple-word searches

While theorists agonise about boolean ‘and/or’ searches, Google has clearly established that for straightforward searching, the most useful search results are those including all search terms.

Within a single database field, this can be achieved by
    Select * From tbl Where fld Like '*salt*' And fld Like '*pepper*'

The trouble comes in covering more than one database field; if ‘salt’ is in one field, and ‘pepper’ in a different field, it would take an impractically complex SQL query to find the correct results for an arbitrary number of search terms across an arbitrary number of fields.

Whole-word searching

If punctuation undermines the strategy of surrounding search terms with spaces, the obvious solution is to convert all punctuation to spaces, after which surrounding search terms with spaces will work.

The trouble is that the matching against search terms is done within the SQL query, inside the database. This limits one’s options, as neither client-side nor server-side scripts can help; JavaScript & VBScript are out of the picture. Standard SQL offers only the simplest wild-card matching, and provides no regular-expression matching (otherwise full-words could be matched by the simple regular expression "([^a-zA-Z]|^)salt([^a-zA-Z]|$)"). (Some databases, such as Oracle and DB2, do have proprietary extensions for regular expression handling).

Search field

The solution is to have a separate database field which can hold a reformatted version of the field to be searched, with all punctuation replaced by spaces. I would sooner have avoided messing with the database, and have implemented search facilities purely in the website, but I couldn't find a way. Like this, it’s not enormously elegant, but it works.

Populating this ‘search’ field is done by an update query which replaces all punctuation by spaces. This is a task beyond SQL queries: a quick dip into some programming is required.

The query which populates the search field will be along the lines of
    Update tbl Set srchfld = " " & stripPunctuation(fld) & " "

This stripPunctuation function can be implemented in VisualBasic (I was doing this in Access) as follows:

Public Function stripPunctuation(str)
  str = Replace(str, ".", " ")
  str = Replace(str, ",", " ")
  str = Replace(str, ":", " ")
  str = Replace(str, ";", " ")
  str = Replace(str, "-", " ") ' hyphen
  str = Replace(str, "–", " ") ' n-dash
  str = Replace(str, "—", " ") ' m-dash
  str = Replace(str, "?", " ")
  str = Replace(str, "…", " ")
  str = Replace(str, "(", " ")
  str = Replace(str, ")", " ")
  str = Replace(str, "[", " ")
  str = Replace(str, "]", " ")
  str = Replace(str, " '", " ") ' strip quotes around words but leave
  str = Replace(str, "' ", " ") '   internal quotes eg O'Patrick
  str = Replace(str, "  ", " ") ' two spaces down to one
  stripPunctuation = str
End Function

Or, perhaps more elegantly and flexibly, by implementing a version of the Cold Fusion ReplaceList function as follows:

Public Function stripPunctuation(str)
  str = ReplaceList(str, ".,:,;,-,–,—,?,_,(,),[,], ',' ,  ", _
                         " , , , , , , , , , , , ,  ,  , ")
  str = Replace(str, ",", " ") ' do this separately as 
  stripPunctuation = str       ' comma is list separator
End Function

Public Function replaceList(str As String, 
                            list1 As String, 
                            list2 As String) As String
  ' within 'str', replace all terms in comma-separated 
  ' list 'list1' with matching term in 'list2'
  Dim l1() As String, l2() As String, i As Integer
  l1 = Split(list1, ",")
  l2 = Split(list2, ",")
  For i = 0 To UBound(l1) - 1
    str = Replace(str, l1(i), l2(i))
  Next i
  replaceList = str
End Function

To take an example, if we had a database field containing the text
     ‘Form_factor: small, power requirement (at low consumption): 200w
the stripPunctuation function would replace it with
     ‘Form factor small power requirement at low consumption 200w’.

This would then enable the space-surrounded search term to operate correctly, where it couldn’t against the original text.

This function could be run as often as required, depending on how often data changes, using the update query above.

The only remaining problem is to implement the searching within this transformed search field. Remember, the idea is that users can enter an arbitrary number of search terms, and have results returned where all search terms are matched.

Criteria builder

For converting a list of search terms into a SQL query, I used some of the power of JavaScript to build the SQL search criteria I want. I’ve switched to JavaScript for this because this is done on the web server, so I can opt for JavaScript’s more powerful regular expression handling.

You may have to think about this. What we need to do is to transform keywords such as ‘salt pepper’, or perhaps ‘salt, pepper’ into SQL query criteria such as ‘Like '* salt *' And like '* pepper *'’. JavaScript regular expressions can do this very cleverly, though in the code here I replace the Windows ‘*’ wildcard with the SQL standard ‘%’ wildcard (a more sophisticated version could consider terms prefixed by ‘-’ as exclusion from the search, and terms in quotes as a single phrase).

function buildAndConditions(terms, field)
{

    // globally replace white-space and comma with 'And' term
    return(field + 
           " Like '% " + 
           terms.replace(/[\s,]/g, " %' And " + field + " Like '% ") + 
           " %'");
}

This will replace any whitespace between individual search terms with the text ‘ %' And fld Like '% ’.

If a user entered a single search term ‘salt’, the replace method would do nothing to the search term, and the resulting criteria would be ‘fld Like '% salt %'’. If the user entered a pair of search terms, like ‘salt pepper’ (or ‘salt, pepper’), the replace method would replace the whitespace with ‘ %' And fld Like '% ’, resulting in criteria of ‘fld Like '% salt %' And fld Like '% pepper %'’ (where the underlined section was substituted by the replace method). With more search terms, this creates appropriate ‘And’ criteria between them all.

Using this JavaScript function, within my template web page I can build a SQL query

    strSQL = "SELECT ID, fld FROM tbl WHERE " & _
              buildAndConditions(Request("srchTerm")),"srchfld")

– in this case, using the ASP Request function to obtain the search terms from an HTML form or a URL parameter. The JavaScript has to be run server-side, of course (runat="server" in IIS).

If this query was invoked with a search term of ‘salt, pepper’, it would generate the SQL query

    SELECT ID, fld 
    FROM tbl 
    WHERE srchfld Like '* salt *' 
      AND srchfld Like '* pepper *'

It may have been a hard journey (at least it was for me), but this approach gives us Google-style whole-word searching on multiple search terms, within database-sourced websites.

Multiple search terms across multiple fields

Having incorporated a ‘search’ field into the database, it is now a simple step to enable searching across multiple fields. All that is required is to combine all fields which need to be searched across into a single search field.

To do this, the update query which populates the search field becomes

Update tbl Set srchfld = " " & stripPunctuation(fld1) & " " & stripPunctuation(fld2) & " "

Further refinements - accented characters

For this project, I also had to enable searches for accented characters. Many databases will provide case-insensitive searches, but most will not match accented against un-accented characters. So for a user to search for detente or détente, they would have to know how the word had originally been entered into the database.

The ReplaceList function used above can also be used to convert accented characters to their un-accented equivalents. In this case, it has to be used to convert accented characters to their un-accented equivalents both in the database ‘search’ field, and in the web page construcing the search query. In this way, it doesn’t matter whether the database contains accented characters or not, or whether the user’s search terms include accented caracters or not – all the matching is done against un-accented equivalents.