StatusNet schema-x branch in progress….

While you’re all waiting for the StatusNet 0.9.6 patch release, here’s an update on our StatusNet 1.0.x infrastructure projects…

To make upgrades and plugin installation go more smoothly, we’re overhauling the schema API to let us use a flexible, database-independent table definition language for the entire database. For now I’ve been adapting the definition syntax from Drupal’s schema API though we’re not (currently) actually using any of their code. (It looks like the GPL licensing should be clear, so it’s a matter of whether the code would be cleanly sharable.)

Work is progressing on the schema-x branch on my StatusNet git clone, and will be folded into mainline 1.0.x once it’s reasonably stable.

So how does all this work, you ask?

Let’s take a peek at the foreign_subscription table. This is used to record known subscription/friends relationships on other services, such as when you’ve connected a Twitter or Facebook account to your StatusNet setup.

It’s a fairly simple relational database table, which connects two records from another table (foreign_user), identified by a service ID key and a per-service user ID for each connected record. The table definition, using the Drupal-style definition arrays, looks like this:

$schema['foreign_subscription'] = array(
    'fields' => array(
        'service' => array('type' => 'int', 'not null' => true, 'description' => 'service where relationship happens'),
        'subscriber' => array('type' => 'int', 'size' => 'big', 'not null' => true, 'description' => 'subscriber on foreign service'),
        'subscribed' => array('type' => 'int', 'size' => 'big', 'not null' => true, 'description' => 'subscribed user'),
        'created' => array('type' => 'datetime', 'not null' => true, 'description' => 'date this record was created'),
    ),
    'primary key' => array('service', 'subscriber', 'subscribed'),
    'foreign keys' => array(
        'foreign_subscription_service_fkey' => array('foreign_service', array('service' => 'id')),
        'foreign_subscription_subscriber_fkey' => array('foreign_user', array('subscriber' => 'id', 'service' => 'service')),
        'foreign_subscription_subscribed_fkey' => array('foreign_user', array('subscribed' => 'id', 'service' => 'service')),
    ),
    'indexes' => array(
        'foreign_subscription_subscriber_idx' => array('service', 'subscriber'),
        'foreign_subscription_subscribed_idx' => array('service', 'subscribed'),
    ),
);

That DB-independent schema definition gets filtered and converted to the particular data types that each database engine will support.

Using the dumpschema.php test script, we can have the Schema system peek at my 0.9.x-installed MySQL database and pull the actual current table and index definitions:

$ php scripts/dumpschema.php foreign_subscription
'foreign_subscription' => array(
    'fields' => array(
        'service' => array('type' => 'int', 'not null' => true, 'description' => 'service where relationship happens'),
        'subscriber' => array('type' => 'int', 'not null' => true, 'description' => 'subscriber on foreign service'),
        'subscribed' => array('type' => 'int', 'not null' => true, 'description' => 'subscribed user'),
        'created' => array('type' => 'datetime', 'not null' => true, 'description' => 'date this record was created')
    ),
    'primary key' => array('service', 'subscriber', 'subscribed'),
    'indexes' => array(
        'foreign_subscription_subscriber_idx' => array('subscriber'),
        'foreign_subscription_subscribed_idx' => array('subscribed')
    )
)

To be able to update existing database tables, it needs to calculate the differences between what’s in the DB now, and what it’s supposed to look like.

We can pass the --diff option to dumpschema.php to visualize this:

$ php scripts/dumpschema.php --diff foreign_subscription
'foreign_subscription' => array(
    'fields' => array(
        'OLD subscriber' => array('type' => 'int', 'not null' => true, 'description' => 'subscriber on foreign service'),
        'NEW subscriber' => array('type' => 'bigint', 'not null' => true, 'description' => 'subscriber on foreign service'),
        'OLD subscribed' => array('type' => 'int', 'not null' => true, 'description' => 'subscribed user'),
        'NEW subscribed' => array('type' => 'bigint', 'not null' => true, 'description' => 'subscribed user')
    ),
    'indexes' => array(
        'OLD foreign_subscription_subscriber_idx' => array('subscriber'),
        'NEW foreign_subscription_subscriber_idx' => array('service', 'subscriber'),
        'OLD foreign_subscription_subscribed_idx' => array('subscribed'),
        'NEW foreign_subscription_subscribed_idx' => array('service', 'subscribed')
    )
)

We can see that the foreign_subscription table’s definition has been updated with some fixes: the user ID field types have been changed to match the referenced foreign_user table, and indexes have been tweaked to take into account that foreign user IDs need to be looked up within the context of a service ID.

Neat, right? But not all that useful yet. What we really want is for it to be able to produce SQL statements to modify the table to the state it should be:

$ php scripts/dumpschema.php --update foreign_subscription
-- 
-- foreign_subscription
-- 
DROP INDEX foreign_subscription_subscriber_idx ON foreign_subscription;
DROP INDEX foreign_subscription_subscribed_idx ON foreign_subscription;
ALTER TABLE foreign_subscription MODIFY COLUMN subscriber bigint not null comment 'subscriber on foreign service',
MODIFY COLUMN subscribed bigint not null comment 'subscribed user';
CREATE INDEX foreign_subscription_subscriber_idx ON foreign_subscription (service,subscriber);
CREATE INDEX foreign_subscription_subscribed_idx ON foreign_subscription (service,subscribed);

These are the SQL statements that would be run during a ‘checkschema’ operation, modifying the existing table to match our new definition. Each database engine can override bits of the SQL generation; there are a number of differences between how MySQL and PostgreSQL do things for instance.

If we’re super-brave, we can go further and enable foreign key support, which makes the relationships between tables explicit and lets the database raise errors if we try to do something that would create inconsistent data. Our statusnet.sql table definitions file has had ‘REFERENCES’ clauses for a long time, but the way we declared them MySQL doesn’t actually enforce them — they’re really just for documentation purposes. Our PostgreSQL definitions have had actual functioning foreign keys, but haven’t gotten as much testing as the MySQL version.

Enabling foreign key support for MySQL can help developers and testers to confirm correct behavior, without forcing them to also be on in production where the performance premium might be too high. (In an ideal world though I’d love to have them running in production too!)

To turn on this option, add to config.php:

  $config['db']['mysql_foreign_keys'] = true;

With foreign keys enabled, the schema system will happily try to add the key defs:

$ php scripts/dumpschema.php --update foreign_subscription
-- 
-- foreign_subscription
-- 
DROP INDEX foreign_subscription_subscriber_idx ON foreign_subscription;
DROP INDEX foreign_subscription_subscribed_idx ON foreign_subscription;
ALTER TABLE foreign_subscription MODIFY COLUMN subscriber bigint not null comment 'subscriber on foreign service',
MODIFY COLUMN subscribed bigint not null comment 'subscribed user',
ADD CONSTRAINT foreign_subscription_service_fkey FOREIGN KEY (service) REFERENCES foreign_service (id),
ADD CONSTRAINT foreign_subscription_subscriber_fkey FOREIGN KEY (subscriber,service) REFERENCES foreign_user (id,service),
ADD CONSTRAINT foreign_subscription_subscribed_fkey FOREIGN KEY (subscribed,service) REFERENCES foreign_user (id,service);
CREATE INDEX foreign_subscription_subscriber_idx ON foreign_subscription (service,subscriber);
CREATE INDEX foreign_subscription_subscribed_idx ON foreign_subscription (service,subscribed);

Note that things probably won’t all work 100% right with foreign keys on MySQL yet; for one thing they don’t work on MyISAM tables, which are needed to use the full-text search for notice and profile data. :)

But Brion, how do these tables get created in the first place with no SQL file?

The installer has been updated to use the Schema interface. The core (non-plugin) table definitions are listed in db/core.php, which are pulled to run through Schema for creation.

This has necessitated a little bit of code rearrangement to get everything Schema needs to work up and running mid-install, but it seems to more or less work. Note that there may be some minor issues sill keeping the installer from 100% working at the moment.

We saw how the logic works, but when do upgrades actually happen for core tables?

That hasn’t gone in just yet… my current plan is to fold in those checks with the rest of the checkschema event, so we’ll handle core and plugin tables in the same swath of stuff.

But won’t that be slow to check all those tables?

It would indeed be pretty slow to do a full check of what every table looks like on every hit. However, we should be able to eliminate that and still get nice auto-updating behavior.

This hasn’t yet been implemented, but I plan to add a ‘schema_version’ table listing all the known tables in the system and a checksum of their schema definition array as of when we last updated them. Now, instead of pulling metadata for dozens of tables (several queries each), we only need to do one quick lookup on the ‘schema_version’ table — which itself can be cached.

If a checksum is missing or doesn’t match the array we’ve got now, then we know we need to go check that table out in detail.

So Brion, how does this change how plugins add tables?

Existing plugins using the 0.9.x schema interface should actually still work: your old TableDef/ColumnDef stuff will be automatically converted into the new array layout when you pass it in to Schema->ensureTable().

But you’ll make your life nicer if you switch fully to the new system, since it’ll let us drop a lot of duplicated code that plugins have to do in 0.9…

If the internal API stays stable (and it may not ;), you would have your table’s class extend Managed_Data_Object, an extended version of our cachable Memcache_Data_Object which knows how to build its DB_DataObject internal metadata and Memcache_Data_Object cache keys from one of our table definition arrays. Add a getSchema static method to return the new-style array, and run that through Schema::ensureTable() in your onCheckSchema handler.

More of the common logic will get moved into the Plugin class, so once it’s finished you should be able to simply declare which table names/classes to add without messing with Schema manually.

2 thoughts on “StatusNet schema-x branch in progress….”

Aryeh Gregor says:

October 21, 2010 at 10:07 am

Awesome. I wish we could could have this for MediaWiki too. Someday . . .
brion says:

October 25, 2010 at 1:02 pm

Yeah, Chad was saying this would be great for the installer & updaters; there was some earlier work along similar lines in MW’s new-installer branch but it got put aside as it wasn’t quite working out.

I’ll be seeing what I can merge in :D

Comments are closed.