Saturday, March 7, 2009

Histogrammed data in a database.

This is a recurring issue; How to efficiently update a large set of histogrammed data in a database? Suppose you have the following table in a database:
CREATE TABLE histo
(
  x INT NOT NULL,
  y INT NOT NULL,
  value INT NOT NULL
)
For instance, the data could represent a frequency distribution of values, that are classified according to the (x,y)-property. I.e., value is histogrammed on (x,y). I have often run into situations where the coordinates (x,y) can vary between 1 and 20000 so the table will possibly contain 200002=4·108 rows. To update a value to 1234, for (x,y)=(1,2), the following SQL statement can be executed:
UPDATE histo SET value=1234 WHERE x=1 AND y=2
To update a set of values, however, is not easily accomplished with a single UPDATE. You can set value on a set of (x,y) pairs with one UPDATE but not different values for different (x,y)'s. The following is valid SQL:
UPDATE histo SET value=1234 WHERE x=1 AND y IN(1,2)
The value column be set to 1234 for (x,y)=(1,1),(1,2). Suppose you want to set value to 1234 for (x,y)=(1,1) and value to 5678 for (x,y)=(1,2). So, you would have to execute the following two SQL statements:
UPDATE histo SET value=1234 WHERE x=1 AND y=1
UPDATE histo SET value=5678 WHERE x=1 AND y=2
Ok, you can perhaps live with executing an extra statement or two. But when the number of updates needed increase, the burden on the database grows as well. It is not possible to do these two updates in only one statement.