Paging

1. Design Patterns for Paging

1.

Design Patterns for Paging

Various

The basic question started as a query how to obtain and display, client side, a number of linked pages from a database.

Wouldn't it be easier to simply paginate the XML before it gets to the translator? ie. throw windows of XML data at the translator and translate for each page?

Steve Muench adds:

I've been out to several customers who've tried to do this "paging" in one way or another with XSLT. The least efficient way I've observed is a customer who was querying 25,000 rows of database data into XML (using the Oracle XML SQL Utility) and then using top-level stylesheet parameters and <xsl:if> elements in their stylesheet to "filter" the data to emit only rows N through M of those 25000 onto the browser. (Where M-N is usually in the 10-15 range).

Yikes.

Much more efficient is to, as you suggest, query-up/pass-in only the 15 rows in the current "window" you want to format and not to use the in-memory filtering capability of XSLT as a replacement for the fast indexes that a database might be able to use to quickly return just the rows you want to "see" on the current page.

Terris" <terris@terris.com> adds:

Easier said than done. Databases can not directly perform

select * from A 
return rows 50-60

This is a simple example. Most queries contain joins. Queries can be expensive. The query cost can prohibit executing the same query over and over again, scrolling through the results for the desired subset. And anyway the result set might change between executions. You would be surprised how many sites actually do this though.

You either consume disk or memory, take your pick. If you go the disk route (disk is much cheaper than memory), you either use the database or the file system. For example, you could store XML data sets in files.

One database-based solution is to create a temporary table that holds the query results. This temporary table has an additional "row number" column which is indexed.

There are several problems with this solution...

1. It consumes database resources
2. If you use a "real" temporary table you have to associate a database connection with the session .. don't even try it. So you have no choice but to create real tables on the fly.
3. Tables have to be cleaned up after a session times out
4. Need to make sure the same session can only have one active temporary table at a time, otherwise one session can consume an excessive amount of disk
5. Database engines were not designed for this -- engines like the schema to stay constant, and creating and dropping tables every few minutes is disastrous

Instead of using temporary tables, create one "query results" table that is also indexed on session ID. You still have to garbage collect the rows at some point, but managing rows is easier than managing entire tables.

A database solution is appealing because you can leverage indexes. A database can be accessed from any machine on a server farm, so sessions don't have to be tied to particular servers.

The devil is always in the details. Where does the row number come from? Unfortunately, you usually can't use the select INTO clause to populate the query results table. This is because you can't tell SQL to generate an incrementing row number based on the ORDER BY clause. Some dialects of SQL might actually support this. Without such support you have to write code that iterates through the result set, adds an incrementing row number column, and inserts the row into the query results table. This can be very slow for large result sets.

It seems natural to use XML to represent temporary result sets, but generally this approach doesn't scale.

XML files are fundamentally slow because the file system can't index XML documents. Stores like eXcelon can index XML but all of the undesirable features of temporary tables described earlier will creep up.

If your data sets are small, you could get away with managing temporary XML files in the file system. You could try to forego files and cache documents in memory. Row-based access is fast when the document has an ID attribute whose value indicates the row number (r1, r2, r3, etc.). XML parsers automatically index ID attributes for fast access. However, this solution doesn't work on server farms.

In conclusion...

The short answer is this is not easy. The longer answer is that the best solution seems to be to store query results in a permanent table that is specifically indexed for page-based access. The best implementation depends on the size of the result sets. Database-based solutions scale up to very large result sets and scale out to web farms.

You can even pre-populate the query results table with your most popular queries. This out-perfoms all possible implementations and that's an understatement. Of course the assumption is that the underlying data doesn't change frequently, which may or may not be correct.

I wish I could say that raw XML technologies somehow solve this problem. There are limitations to all tools. XML could be a facilitator, however, You could build a component that manages result sets in a database and returns "pages" as XML. NOW you're talking.

Larry Mason warns:

Now throw in user sorting capability and your row number concept is shot. Pages are in the eye of the consumer not the producer.

Kevin Williams:

I see this mistake everywhere I look, and it's been a problem since before the days of XML. The cardinal rule when marshalling data should be "eliminate unnecessary information as early in the process as possible" - typically in the relational database layer itself. Otherwise, you're just chewing up bandwidth and memory, and forcing platforms not well suited to data manipulation (such as XSLT) to do the work instead...

Francis Norton offers:

I've had good results from a two step process, with variable length items.

Step 1 is mark up the printable elements by simply adding newPage (boolean), pageNumber and LineNumber attributes to the printable elements, using a recursive template with page and line parameters, which knows how to size the elements.

Step 2 is to merge the data into a format stylesheet, which then becomes fairly simple - match a page template for-each element where newPage = true(), then for-each element with that pageNumber, apply-templates in layout mode to find a item template which prints and positions the element correctly.

Joshua Allen goes on:

But ... databases *can* directly perform

select * from A
return 50 rows where x > the_last_x_in_my_previous_resultset

With SQL Server you can use the "TOP" operator or "SET ROWCOUNT" to specify how many rows are returned. I am certain that DB2 has this capability as well, and I think Oracle goes even beyond this ("bookmarks" or something? It has been a long time...)

Of course, your example of using a column indexed specifically for paging is a great strategy. But if you already are returning the data in some non-random order, then you can also use the above idea to return pages. Using the above avoids the need to scroll through the resultset each time, and allows page cutoffs to change as data gets inserted. Materialized views can alleviate the pain of joins, and heavy-read situations are not friendly to joins anyway.

If you can tolerate a small degree of data staleness, it would also be possible to pull back the whole *entire* resultset as XML and cache the individual "pages" as separate XML files on the filesystem. (I mean, I have yet to see a system that can update the user's web-page when the underlying data in the database changes, so you always have *some* degree of staleness). Recognizing this and the relation to caching leads to all sorts of interesting options. One of the worst possible choices for paging is to scroll through the *entire* rowset *every* time a user makes a request. You may have to do this in some situations, but probably far fewer than most people realize.

Just some additional info to the original idea of paging via XSLT -- sometimes people want to do this so that they can target devices with multiple form factors (i.e. use 30 lines per page if the browser is Netscape, use 11 lines per page if it's a Palm VII). One word of caution in this approach is that usability is so challenging on the small devices that usability concerns usually far outweigh the programmability concerns that lead people to consider this strategy. You might make it work, but many people have experienced the only way to get their systems tolerable on small devices is custom- build UI for each device (for example, on a web browser you can show city and state together for input, but on a WML device you are better to have one screen that lets you choose a letter range for the first letter of the city, then the next screen with more detail, etc.; very difficult to XSLT-ize). So don't be too disappointed if reality foils the well-laid plans :-)

Mike Kay provides a possible (reasonable) break point:

If you've been following other threads you'll have seen the advice that if you're getting the data from the database it's best to generate one XML document for each displayed page of data.

However, at the level of 1000 records, handling a single document is quite feasible and despite what others have said, if you're doing the processing in the client I think it's quite a reasonable option, because it reduces the number of hits on the server. You'll find an example using this sort of architecture on page 608 of my book*: it's not doing paging in the way you describe, but a similar kind of interactive navigation within a dataset.

Steve Muench closes with an Oracle comment:

As a final posting, I'll note that the way with Oracle to achieve this is to use the "TOP-N" query feature in Oracle8i.

If you are ordering your data by a value that makes each row distinct (e.g. the primary key or any other unique key), then to retrieve the "next 10" rows after a certain key that was "last" on the current page, you can do:

SELECT userid, firstname, lastname, phone
  FROM ( SELECT userid, firstname, lastname, phone
           FROM all_emps
          WHERE lastname like '%JONES%'  /* This was the orig query critera */
           AND  userid > 'abjones'       /* Return matches beyond 'abjones' */
        ORDER BY userid                  /* Order the inner query by userid */
       )
  WHERE ROWNUM <= 10  /* Get the "top-10", i.e. first 10, rows */

I cover this topic of stateless paging in my book on pages 643-655.

| The "big" advantage I get by paging and sorting at the browser(/client)
| end is that I do not have to hit the web_server/db_server again for the
| same (sorted/paged) data.If I have to save some db hits, I may have to
| cache the data using sessions.

The tradeoff depends on the probability that the users will visit all of the rows. If the user is likely to visit all the rows, then one big hit is probably better and browser-side XSLT can be helpful.

If the user is likely to conclude after paging through a couple of pages that they *should* have refined their query criteria a little better (imagine a Google.com search where you typically don't go beyond the first couple of pages of "most relevant" hits), then downloading all the hits might be a bad idea. Depends on the app.