TagSchema

Contents

[edit] Tagging and Folksonomy Schema Concepts

This page and related pages describe various issues surrounding the Web 2.0 tagging and folksonomy concepts, including discussions on schema architecture, common data access patterns, and replication/scale-out guidelines.

[edit] Core Concepts and Definitions

[edit] Tagging

"Tagging" is a form of content and data categorization that is evolving with modern web applications. It is characterized by free-form text (words or small phrases) that is "attached" or "tagged" to an item. The item could be anything -- a blog post, a bookmark, a product, another user, etc. A tag schema is a database schema which attempts to facilitate the storage and retrieval of tag data, and to allow for increasingly complex data analysis of the relationships formed by linked tags and items.

One of the advantages that tagging offers over rigid categorization of content is exactly that: it is free-form and allows users to categorize the content and data the way they want, as opposed to the way the content "owners" want. This enables the user-driven aspect of content production in much the way that the wiki editing style has enabled user contributions to content.

This advantage however also has its downsides, in that the free-form nature of tagging can lead to an explosion in the amount of data used to categorize content. Tags must be successfully de-duplicated and maintained in order to best represent the categories of data.

[edit] Folksonomy

People needed a way of describing the evolving science of connecting users of an application with each other via direct or indirect linking. Thus, the word "folksonomy" came to represent adding the "user" dimension to data analysis and database schemas. Folksonomy refers to the analysis of a user's connections to other things in an application ecosystem, whether those things are other users, items, tags, or whatever.

This user dimension allows the application developer to provide multiple ways of viewing and analysing the data stored in their databases. It allows the database developer to answer questions such as:

[edit] Participation, Connection and Interaction

Tagging and folksonomy encourage and facilitate multiple users editing and categorization the same content. This leads to a couple things:

Tags becoming the new categorization/filing of this content because:

So, the common concept here is “linking”, which is the "R" in RDBMS.

[edit] Schema Considerations

[edit] Various Schema Architectures

There are a number of schema designs that have been popularized. The three main ones are usually called "MySQLicious", "Scuttle", and "Toxi". These sections describe these schema architectures. Please note that in all the examples below, the bookmark "item" is used as the "tagged" entity. This is simply for illustrative purposes. The item could just as easily be a blog post, product, rss feed item, or anything else.

[edit] MySQLicious

The MySQLcious tag schema design is entirely denormalized, with a single fact table that comprises that schema. Tags are stored in a delimited list in a single field, and either textual LIKE expressions or FULLTEXT indexing is used in order to search for matched tags.

[edit] Sample Schema

Here is a sample MySQLicious schema. Note that certain fields existing in the actual MySQLicious schema have been removed that do not specifically pertain to tagging...

CREATE TABLE bookmarks (
bookmark_id INT UNSIGNED NOT NULL AUTO_INCREMENT
, url VARCHAR(255) NOT NULL
, tags TEXT NULL
, PRIMARY KEY (bookmark_id)
, FULLTEXT INDEX (tags)
) ENGINE=MyISAM;

Note that the MyISAM storage engine is used so that FULLTEXT indexing can be used to query for bookmarks containing matching textual expressions.

[edit] Advantages

The only real advantage to the MySQLicious schema design is its relative simplicity. Only one table is used and queried. For very small databases, this schema might be attractive because of its simplicity

[edit] Disadvantages

The schema has three distinct disadvantages:

  1. Because it is denormalized, it encourages duplication of a lot of data and certain queries to determine relationships between users (were a user_id dimension to be added to the table) would be difficult to code and slow to perform
  2. FULLTEXT indexing is really the only way to get acceptable performance out of the database schema. Because of this, the choice of storage engines would be limited to MyISAM, which introduces concurrency issues in heavier-write applications and inhibits the ability to enforce foreign key relationships at the database level.
  3. The schema is not flexible. Having only one table which describes every entity prevents the database developer from being able to easily and efficiently query for information on specific entities.

[edit] Scuttle

The Scuttle solution uses a single fact table for bookmarks and a relationship table for tags. It typically looks like the following:

[edit] Sample Schema

CREATE TABLE bookmarks (
bookmark_id INT UNSIGNED NOT NULL AUTO_INCREMENT
, url VARCHAR(255) NOT NULL
, PRIMARY KEY (bookmark_id)
) ENGINE=InnoDB;
CREATE TABLE bookmark_tags (
record_id INT UNSIGNED NOT NULL AUTO_INCREMENT 
, bookmark_id INT UNSIGNED NOT NULL # Foreign key to bookmarks table
, tag TEXT NOT NULL
, PRIMARY KEY (record_id)
, INDEX (tag)
) ENGINE=InnoDB;

The bookmark_tags table also might use the MyISAM storage engine in order to take advantage of FULLTEXT indexing (while losing any foreign key enforcement):

CREATE TABLE bookmark_tags (
record_id INT UNSIGNED NOT NULL AUTO_INCREMENT 
, bookmark_id INT UNSIGNED NOT NULL
, tag TEXT NOT NULL
, PRIMARY KEY (record_id)
, FULLTEXT INDEX (tag)
) ENGINE=MyISAM;

[edit] Advantages

The Scuttle solution is certainly more normalized than the MySQLicious solution, however it still does not prevent duplication of tag textual data in the bookmark_tags table. Foreign keys can be used to enforce the relationship between bookmark and the bookmark's set of tags. Because only one tag is stored per row in the bookmarks table, equality conditions can be used, unlike MySQLicious, which leads to greater performance advantages, comparable or better than the FULLTEXT indexing required by MySQLicious.

[edit] Disadvantages

The Scuttle solution is still not properly normalized, and has the following flaws:

  1. A "surrogate" key is used in the bookmark_tags table (record_id) instead of using the naturally-occuring primary key tuple on bookmark_id and tag_text
  2. Storage space (and thus memory usage) is increased since the textual tag data is used in the tags table instead of an ID of the tag text itself. This leads to fewer index records being able to fit into a single index block, therefore reducing the efficiency and speed of index accesses
  3. There is no easy way to accurately deduce statistical information (e.g. counts of bookmarks for each tag) with the existing schema because the one to many relationship mis-represents the true many to many relationship that exists between bookmarks and tags

[edit] Toxi

The final solution popularized on the Internet is known as the "Toxi" solution. It typically looks like the following, which represents a three-table many-to-many mapping between bookmarks and tags:

[edit] Sample Schema

CREATE TABLE bookmarks (
bookmark_id INT UNSIGNED NOT NULL AUTO_INCREMENT
, url VARCHAR(255) NOT NULL
, PRIMARY KEY (bookmark_id)
) ENGINE=InnoDB;
CREATE TABLE bookmark_tags (
record_id INT UNSIGNED NOT NULL AUTO_INCREMENT 
, bookmark_id INT UNSIGNED NOT NULL # Foreign key to bookmarks table
, tag_id INT UNSIGNED NOT NULL # Foreign key to tags table
, PRIMARY KEY (record_id)
, INDEX (bookmark_id, tag_id)
) ENGINE=InnoDB;
CREATE TABLE tags (
tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT 
, tag_text TEXT NOT NULL
, PRIMARY KEY (tag_id)
, INDEX (tag_text)
) ENGINE=InnoDB;

Of all the popularized solutions, this schema closest matches the recommended schema design. It does, however, have a couple flaws.

[edit] Advantages

The Toxi solution takes the previous two designs further along the normalization road by correctly separating the many to many relationship into a mapping table (bookmark_tags). Doing so eliminates the duplicated information (particularly the textual tag data) that plagues the other two designs.

This normalized solution allows summary or statistical data to be stored for each of the entities themselves, as well as other information concerning the bookmark-tag relationship (such as when those key values were stored). This is perhaps the best use of space within main memory of the DB server, allowing for fast lookups on the tag text itself, and matching of the relationship based on quick integer keys.

[edit] Disadvantages

The main disadvantage to the Toxi schema (as is the case with most properly normalized schema) is that there are now more tables to maintain and administer. In addition to this, there are a couple flaws in the popularized Toxi design that our recommended solution aims to remedy:

  1. Again, a surrogate key is used in the mapping table (record_id). This is not needed and slows down performance, particularly in the case of the InnoDB storage engine.
  2. A UNIQUE INDEX is not used on the tag_text in the tags fact table. This is needed to eliminate duplicate tag text values.
  3. An index is not provided in the mapping table (bookmark_tags) which would allow efficient querying on the tag_id itself, without being coupled with the bookmark_id. Any query grouping on tag_id that did not supply a constant bookmark_id (or range of bookmark_id values) would not be able to use the index on (bookmark_id, tag_id) in order to efficiently process the data.

[edit] Recommended Architecture

The recommended solution for an efficient, normalized tagging schema is a refinement on the Toxi solution. The storage engine used in the recommended solution is the InnoDB storage engine, because of its benefits of higher concurrency limits, referential integrity, and a clustered data organization, which makes primary key lookups very fast. Primary key lookups are the dominant data access pattern in tagging schema, which makes InnoDB an ideal storage engine, especially for the master server in a replicated setup (see more below)

[edit] Schema

The schema involves three tables which represent the many to many relationship between the item and the tag. Here, we deviate from the bookmark paradigm and use a general "item" semantics to indicate that the type of item is irrelevant to the schema design.

CREATE TABLE Items (
item_id INT UNSIGNED NOT NULL AUTO_INCREMENT
, item_name VARCHAR(255) NOT NULL
/* Many more attributes of the item... */
, PRIMARY KEY (item_id)
) ENGINE=InnoDB;
CREATE TABLE Item2Tag (
item_id INT UNSIGNED NOT NULL 
, tag_id INT UNSIGNED NOT NULL 
, PRIMARY KEY (item_id, tag_id)
, INDEX (tag_id)
, FOREIGN KEY fk_Item (item_id) REFERENCES Items (item_id)
, FOREIGN KEY fk_Tag (tag_id) REFERENCES Tags (tag_id)
) ENGINE=InnoDB;
CREATE TABLE Tags (
tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT 
, tag_text TEXT NOT NULL
, PRIMARY KEY (tag_id)
,  UNIQUE INDEX (tag_text)
) ENGINE=InnoDB;

[edit] Sample Schema with User Dimension

Additionally, when the application has a need for a user dimension (folksonomy), the schema can be expanded like so:

CREATE TABLE Users (
user_id INT UNSIGNED NOT NULL AUTO_INCREMENT
, user_name VARCHAR(255) NOT NULL
/* Many more attributes of the user... */
, PRIMARY KEY user_id)
) ENGINE=InnoDB;
CREATE TABLE Items (
item_id INT UNSIGNED NOT NULL AUTO_INCREMENT
, item_name VARCHAR(255) NOT NULL
/* Many more attributes of the item... */
, PRIMARY KEY (item_id)
) ENGINE=InnoDB;
CREATE TABLE Tags (
tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT 
, tag_text TEXT NOT NULL
, PRIMARY KEY (tag_id)
,  UNIQUE INDEX (tag_text)
) ENGINE=InnoDB;
CREATE TABLE UserItemTag (
user_id INT UNSIGNED NOT NULL
, item_id INT UNSIGNED NOT NULL 
, tag_id INT UNSIGNED NOT NULL 
, PRIMARY KEY (user_id, item_id, tag_id)
, INDEX (item_id)
, INDEX (tag_id)
, FOREIGN KEY fk_User (user_id) REFERENCES Users (user_id)
, FOREIGN KEY fk_Item (item_id) REFERENCES Items (item_id)
, FOREIGN KEY fk_Tag (tag_id) REFERENCES Tags (tag_id)
) ENGINE=InnoDB;

[edit] Sample Schema with Summary Tables

If an application requires aggregate data access patterns -- such as COUNTing the number of tags per item, per user/item combination, etc -- then we recommend denormalizing the schema to use aggregate or summary tables which store specific counters. This technique is used in order to bypass the current InnoDB-specific limitation of poorly performing aggregate queries. Below, we use the MEMORY storage engine for the statistic tables since this engine provides excellent speed for primary key access due to its hash-based indexing. The data for MEMORY tables is lost upon system shutdown; however since all the information in the summary tables can be rebuilt from the data in the main tables, there is no need for concern. For systems where memory requirements are an issue, simply change the MEMORY storage engine to InnoDB.

CREATE TABLE Items (
item_id INT UNSIGNED NOT NULL AUTO_INCREMENT
, item_name VARCHAR(255) NOT NULL
/* Many more attributes of the item... */
, PRIMARY KEY (item_id)
) ENGINE=InnoDB;
CREATE TABLE Item2Tag (
item_id INT UNSIGNED NOT NULL 
, tag_id INT UNSIGNED NOT NULL 
, PRIMARY KEY (item_id, tag_id)
, INDEX (tag_id)
, FOREIGN KEY fk_Item (item_id) REFERENCES Items (item_id)
, FOREIGN KEY fk_Tag (tag_id) REFERENCES Tags (tag_id)
) ENGINE=InnoDB;
CREATE TABLE Tags (
tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT 
, tag_text TEXT NOT NULL
, PRIMARY KEY (tag_id)
,  UNIQUE INDEX (tag_text)
) ENGINE=InnoDB;
CREATE TABLE TagStat (
tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT 
, num_items INT UNSIGNED NOT NULL DEFAULT 1
/* For each distinct type of item tracked with tags, add a field... */
, PRIMARY KEY (tag_id)
, FOREIGN KEY fk_Tag (tag_id) REFERENCES Tags (tag_id)
) ENGINE=MEMORY;

[edit] Common Data Access Patterns

The following section describes recommended methods for querying the tag schema for various common data access patterns.

[edit] Getting the "Tag Cloud"

A tag cloud is a way of representing the number of items tagged with a phrase. Tag density typically represented by larger fonts or different colors. Here is the typical way of querying for the data needed to represent a tag cloud:

SELECT tag_text
, COUNT(*) as num_items
FROM Item2Tag i2t
INNER JOIN Tags t
ON i2t.tag_id = t.tag_id
GROUP BY tag_text;

However, this particular query has scalability issues when used with InnoDB tables, due to problems with COUNT(*) queries using that engine. We recommend instead using summary tables, as described above, and using the following query instead to produce the same results:

SELECT tag_text, ts.num_items
FROM Item2Tag i2t
INNER JOIN Tags t
ON i2t.tag_id = t.tag_id
INNER JOIN TagStat ts
ON t.tag_id = ts.tag_id
GROUP BY tag_text;

[edit] Items Having Any of a Set of Tags

A very common access pattern is one in which the purpose of the query is to find any item which is tagged with any of a set of tags. For instance, in the example below, we show a typical query to obtain all items which have been tagged with "beach", "cloud", OR "waves". This query is fairly simple, as the IN() construct accomplishes the OR condition:

SELECT i2t.item_id
FROM Item2Tag i2t
INNER JOIN Tag t
ON i2t.tag_id = t.tag_id
WHERE t.tag_text IN ('beach','cloud','waves')
GROUP BY i2t.item_id;

[edit] Items Having All of A Set of Tags

Another common access pattern is a variation on the above where the purpose of the query is to find any item which is tagged with all of a set of tags. For instance, in the example below, we show a typical query to obtain all items which have been tagged with "beach", "cloud", AND "waves". The first example is the type of query typically given to accomplish this query type, however it has serious efficiency issues, which are resolved in the second example query.

SELECT i2t.item_id
FROM Item2Tag i2t
INNER JOIN Tag t
ON i2t.tag_id = t.tag_id
WHERE t.tag_text IN ('beach','cloud','waves')
GROUP BY i2t.item_id
HAVING COUNT(DISTINCT i2p.tag_id) = 3;

The query accomplishes its work by ensuring that the items included in the final output have all three unique tags being queried for. The HAVING COUNT(DISTINCT i2t.tag_id) = 3 expression is what filters the output correctly.

However, the HAVING COUNT() expression as well as the required GROUP BY expression force MySQL to use a temporary table and filesort in order to generate the query results. This is inefficient, and the same query results can be accomplished via repeated self-joins, as below demonstrates:

SELECT i2t3.item_id
FROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3
INNER JOIN Item2Tag i2t1
ON t1.tag_id = i2t1.tag_id
INNER JOIN Item2Tag i2t2
ON i2t1.item_id = i2t2.item_id
AND i2t2.tag_id = t2.tag_id
INNER JOIN Item2Tag i2t3
ON i2t2.item_id = i2t3.item_id
AND i2t3.tag_id = t3.tag_id
WHERE t1.tag_text = 'beach'
AND t2.tag_text = 'cloud'
AND t3.tag_text = 'waves';

The key to understanding the above query is following the join from one instance of Item2Tag table to the next. Essentially, the self-joining process starts with a set of items attached to the "beach" tag. Then, that reduced set of item IDs is joined to the set of item IDs attached to the "cloud" tag, and then to the set of item IDs attached to the "waves" tag. This self-joining process iteratively reduces the final result set without the need to GROUP or count distinct output records.

The CROSS JOINs shown in the example is used to get three distinct tag IDs from the Tags table. These three records are used to reduce the joined datasets.

[edit] Example of Generating Above Query (PHP)

The above query most often will be found when a searcher enters in a number of tags into a search box and the server-side script must build a query to find the appropriate items containing all three tags. The example below (in PHP) shows how this query might be generated. The example assumes a comma-delimited string of tags and does no error checking, for brevity.

<?php
$search_tags = $_POST['tags']; /* Tag values entered in form */
$search_tags = explode(',', $search_tags);
$sql = "";
$num_tags = count($search_tags);
if ($num_tags > 0 && $num_tags <= 32) {
  $sql_select = "SELECT i2t" . ($num_tags - 1) . ".item_id ";
  $sql_from = " FROM ";
  $sql_where = " WHERE ";
  $sql_joins = "";
  for ($i=0;$i<$num_tags;++$i) {
    if ($i==0) {
      $sql_from .= " Tags t0 ";
      $sql_where .= " t0.tag_text = '" . $search_tags[0] . "'";
      $sql_joins .= " INNER JOIN Item2Tag i2t0 ON t0.tag_id = i2t0.tag_id ";
    }
    else {
      $sql_from .= " CROSS JOIN Tags t" . $i;
      $sql_where .= " AND t" . $i . ".tag_text = '" . $search_tags[$i] . "'";
      $sql_joins .= " INNER JOIN Item2Tag i2t" . $i . " ON i2t" . ($i - 1) . ".item_id = i2t" . $i . ".item_id " . 
                    " AND i2t" . $i . ".tag_id = t" . $i . ".tag_id ";
    }
  }
  $sql = $sql_select . $sql_from . $sql_joins . $sql_where;
}
$results = mysql_query($sql, $some_connection);
?>

[edit] Finding Related Tags Via An Item

[edit] Finding Related Items of All of a Set of Tags (INTERSECTION)

[edit] Finding Unrelated Items (MINUS)

[edit] Finding COUNTs of Related Items or Tags

[edit] Using Summary Tables

[edit] Replication Guidelines for Scaling Out

[edit] The Roles of the Master Server

[edit] The Roles of the Slave Servers

[edit] Using Alternate Storage Engines

[edit] Further Reading

Retrieved from "http://forge.mysql.com/wiki/TagSchema"

This page has been accessed 27,898 times. This page was last modified 21:02, 3 September 2006.

Find

Browse
MySQLForge
Main Page
Current events
Recent changes
Random page
Help
Edit
Edit this page
Editing help
This page
Discuss this page
Post a comment
Printable version
Context
Page history
What links here
Related changes
My pages
Special pages
New pages
File list
Statistics
Bug reports
More...