:(

Not found.

Traversing a tree stored as an adjacency list using a Recursive CTE built in SQLAlchemy

This is part of a series on combining PostgreSQL Materialized Views, Recursive CTEs, and SQLAlchemy:

  1. Using SQLAlchemy to create and manage PostgreSQL Materialized Views
  2. Traversing a tree stored as an adjacency list using a Recursive CTE built in SQLAlchemy (this post)

The problem:

Recently, while working on the RockClimbing.com codebase, I encountered several situations where data needed to be pre-calculated/cached because on-the-fly calculations were too slow.

For example, RockClimbing.com has over 100,000 climbing routes scattered across 35,000 locations.The locations are essentially a tree structure, ranging in specificity from general area to a specific wall:  “North America” > “California” > “Yosemite” > “El Capitan” > “North Face”.

As you browse different areas, it’d be nice to see the total number of climbing routes within each area. For example, if you’re viewing North America, on each state we’d like to display the total number of climbing routes in that state.

In other words, for each node, find all child nodes and then sum the routes attached to those child nodes.

There’s a number of ways to map a tree structure to SQL tables. For simplicity, I store the locations as an adjacency list. Adjacency lists make it easy to insert and reorder nodes, although traversing the tree (especially downwards) can be slow because you have to walk through all the nodes in order to identify the children. An index helps, but it’s still not fast, especially when you need to traverse multiple levels deep.

Trying to solve it with Python:

My initial prototype looked something like this:

This worked fine with my normal test dataset of 100 locations, but became too slow once I increased the test dataset to 50,000 locations.

Switching to a  while loop would improve things slightly since Python doesn’t support tail recursion. But the bigger problem is that accessing  location.children is expensive. My location model is configured with lazy='dynamic' which forces SQLAlchemy to issue a new query for every location. A better solution is pre-fetching all the locations and then iterating through them in memory, but that’s still a lot of data being unnecessarily transferred between the database and the app.

Another attempt using a Recursive CTE:

Instead of doing all this data transfer, if I can write a database query that traverses the tree, then I could handle most of the filtering work in the database and only return the results. Plus the database C code will be much faster than interpreted Python code.

Traditionally, traversing adjacency lists was impossible with normal SQL and required writing custom database functions. Thankfully, we’re using PostgreSQL which supports Recursive CTEs (a quick introduction) that make it a good deal easier:

I tested this using a test dataset to 35,000 locations and 100,000 routes. For the most expensive queries near the top of the tree it took about 200ms-250ms of database time to walk the tree and then calculate the cumulative route count. That’s a huge speedup over doing all the work in Python, but still not fast enough to dynamically generate it on every page load, especially since some pages display the route counts for multiple locations, so they’d need to run the query multiple times.

Pre-calculating the result and storing it:

The nice thing about this problem is we don’t require the results to always be 100% up to date. Only a handful of new routes are added every day, and nothing bad happens if someone browsing the website sees a count that is a few hours out of date.

So a better solution is to pre-calculate the result and cache it somewhere.

I decided against adding a count_recursive_routes column to the location table because it felt hacky–I’d rather separate the calculated/cached data from the original data. I expect the number of things I want to pre-calculate to grow over time, and I don’t want to keep stacking extra columns on the table. Using an in-memory datastore like Redis is another common solution, but so far we haven’t needed it, and I was hesitant to add the extra complexity to our stack just for this.

Instead, I decided to use a PostgreSQL Materialized View. If you’re not familiar with materialized views, my previous blog post provides a good overview.

Generalizing the Recursive CTE across the entire table:

The recursive CTE solution above calculates the recursive route count for a single location, but for our materialized view, we need to generalize the query so it returns route counts for all locations. On the surface, that sounds easy, but it turned out to be much harder. I finally turned to StackOverflow where Erwin Brandstetter suggested solving it using a PostgreSQL ARRAY:

This query took approximately 18 minutes to calculate results for the entire test dataset of 35,000 locations and 100,000 routes. This is plenty fast for something that only needs to run once a day as a background job.

Converting the Recursive CTE to a SQLAlchemy selectable:

In order to use the materialized view with SQLAlchemy, I needed to convert the raw SQL query into a SQLAlchemy selectable and pass it to our custom create_mat_view() function.

Converting the recursive CTE turned out to be more straightforward than I expected. I incrementally built the various components, then combined them into a single query. As mentioned previously, if you’re using Flask-SQLAlchemy, the query needs to be built using the SQLAlchemy core select() function, rather than the more typical session.query().

Anytime you’re translating a complicated SQL query to SQLAlchemy, it’s much easier if you print the individual components to make sure you’re assembling the query correctly. In the below code, I left in a few of these print statements that I used for debugging. You’ll notice that sometimes to see the actual output you need to tell SQLAlchemy to compile the statement using the   postgresql.dialect().

The final print debug statement prints a SQL query that exactly matches the definition of the materialized view (excluding the CREATE MATERIALIZED VIEW portion. Now we just need to pass this to the  create_mat_view() function:

Querying:

You can read about how to query a materialized view using SQLAlchemy in my previous blog post.

Wrapup:

The more I use SQLAlchemy, the more impressed I am with it (and Mike Bayer, the primary developer behind it). Not many ORMs support creating materialized views using a Recursive CTE, the PostgreSQL-specific ARRAY datatype, and more.

Using SQLAlchemy to create and manage PostgreSQL Materialized Views

This is part of a series on combining PostgreSQL Materialized Views, Recursive CTEs, and SQLAlchemy:

  1. Using SQLAlchemy to create and manage PostgreSQL materialized views (this post)
  2. Traversing a tree stored as an adjacency list using a Recursive CTE built in SQLAlchemy
  3. Working example code on GitHub

Overview of materialized views in SQL:

If you’re not familiar with SQL views, they’re basically a temporary table whose contents are the output of another SQL query. Normally the table generated by the view only lasts for the duration of a single containing query, but a materialized view stores (materializes) the results so they can be used in later queries.

Materialized views offer a number of benefits:

  • Fast and simple retrieval. Retrieving data from a materialized view is just like writing queries against a normal table:
  • Unlike normal views, materialized views (at least in PostgreSQL) support traditional database indexes for even faster retrieval:
  • Simple data maintenance. Recalculating all the data is a simple:
  • Zero-downtime refresh. PostgreSQL 9.4 onward supports refreshing a materialized view concurrently in the background without impacting other queries:
  • Doesn’t add complexity to your infrastructure because it’s built into your database and requires no additional services.

Drawbacks of materialized views:

  • Any data you want to store in a materialized view has be the result of another SQL query–you can’t generate the data in your app and then store it in the materialized view.
  • Like any cache, data can go stale. You need to remember to manually issue a REFRESH command to your database. Either your app can update the materialized view after it inserts/updates the source data, or if that causes too many updates, just run a cron job that refreshes the materialized view on a regular schedule. You can also use a database trigger, although you’ll need to create a custom plpgsql function.
  • You can’t update a single row–only the entire view.

Overall, materialized views in PostgreSQL are a very convenient way to add caching to many applications.

Ways to teach SQLAlchemy about a materialized view:

In order to write ORM queries against my materialized views, I needed to tell SQLAlchemy about the materialized view. Querying a materialized view is very similar to a normal SQL table, so the easiest way to do this is manually create the materialized view in your database. Then, within your app, create a normal SQLAlchemy Table mapping that corresponds to the columns in your materialized view.

While this makes it possible to query the materialized view, it also means that you’ll hit problems when creating/dropping all your tables because SQLAlchemy sees the materialized view as a normal SQL table. While uncommon in production, it’s very common thing to do in a test suite or your local dev instance. Additionally, using a traditional SQLAlchemy table mapping makes it difficult to refresh the materialized view from within your app. I already manage all my database tables and metadata with SQLAlchemy, and wanted the convenience of fully managing my materialized views as well.

Adding custom DDL commands to SQLAlchemy for materialized views:

SQLAlchemy doesn’t have a built-in command for creating views, let alone materialized views. But it is possible to subclass DDLElement and create your own custom DDL commands. After several iterations (and help from Mike Beyer), here’s what I ended up with:

Security Note: Using string interpolation to assemble SQL commands always makes me nervous about accidentally creating a SQL injection vulnerability. None of these commands include user-input, so we’re fine. However, be careful not to accidentally re-use this code somewhere else where you are accepting user input.

Creating a materialized view using SQLAlchemy:

This lets us create a materialized view by passing a SELECT statement generated with SQLAlchemy core to our custom  create_mat_view() function.

Here’s an example of how you’d call it–this calculates the total number of reviews and average review rating for each GearItem:

The above code will emit the following DDL statement to the database:

Sidenote: I’m using the factory pattern to create my Flask app, so I can only pass a db.select() query to create_mat_view(). Trying to use the normal  db.session.query() from the SQLAlchemy ORM won’t work because it creates a circular dependency. The Flask-SQLAlchemy sessionmaker() requires an app context, but the db object (and table mappings) are processed by Python before the app is instantiated. So when SQLAlchemy tries to setup the table mappings, calling  db.session() calls the Flask-SQLAlchemy sessionmaker() which throws an exception because there’s no app context. Using db.select() from SQLAlchemy Core avoids this issue because it doesn’t create a session. Within the selectable itself, it’s fine to use ORM constructs such as object references like  GearReview. This works because SQLAlchemy doesn’t have to create a session in order to resolve what a GearReview is.

Querying a materialized view using SQLAlchemy:

Since our recipe handles both creating the materialized view and mapping it to an ORM object, querying is very simple:

Alternatively, if you’re using Flask-SQLAlchemy: GearItemMV.query.get(row_id)

Additionally, this example has a 1:1 mapping between the GearItem objects and the pre-calculated data about them stored in GearItemMV, which makes for convenient queries like this:

While convenient, this results in brittle code because we have direct queries against the materialized view littered throughout our codebase. A better way is to map the materialized view data as attributes on the original objects, decoupling the attribute call from the underlying implementation of how the attribute is calculated/cached. This avoids having to change a bunch of controller/view code if we switch caching mechanisms.

Mapping the cached result as an attribute on the original object:

SQLAlchemy supports several ways to map the materialized view output onto the original object.

Initially I tried using a column_property():

Because column_properties are evaluated at import time, this has to be defined after the GearItemMV definition, it cannot be declared inline with the GearItem model definition.

There were three main problems with this approach:

  1. Under the covers, column_property() uses correlated subqueries. Correlated subqueries are slow because the query planner can’t optimize them–it effectively has to run a complete subquery for every row returned by the parent query. So unlike traditional set-based SQL where retrieving additional rows is O(log n), the work required to retrieve   column_properties is  O(n*m) where n is number of objects and m is the number of column_properties on the objects.
    For example, lets say my GearItem class has two column properties review_count and   review_rating, and I want to retrieve thirty objects from the database. Effectively the database is processing the outer query to return thirty items, plus an additional query for each correlated subquery for each item. You won’t notice this in the SQL query logs because only one query is sent to the database, but if you actually run EXPLAIN on the query, you’ll see that effectively the one query ballooned into doing the work of 61 queries!
  2. SQLAlchemy can’t take advantage of the pseudo-caching ability of the session since this query doesn’t use session.query.get(). Normally, once an object has been loaded within a session, querying it again using db.session.query(object).get(pk_id) will retrieve the existing object from the SQLAlchemy session without re-querying the database. This only happens using the special get() function which first checks the session’s identity map for the primary key.
  3. Column properties are calculated as soon as an object is loaded. So whenever I load a GearItem, even if I never access the review data, I’m still triggering the expensive correlated subquery.

After a little more research, I switched to mapping the attributes onto the parent object using a SQLAlchemy @hybrid_property. Like a normal python property, it’s only evaluated on access, plus it provides additional benefits.

Here was my initial attempt:

It works fine for querying individual items. It also caches the GearItemMV object in the session, so if I retrieve review_count and then later retrieve  review_rating it will only emit one database call.

However, it doesn’t work for querying at the class level (eg, db.session.query(GearItem).filter(GearItem.review_count < 4)). The SQLAlchemy docs suggest handling this using   hybrid_property.expression(), but my attempts felt needlessly overcomplicated and inelegant.

After some more thought, I realized that due to the 1:1 relationship between GearItem rows and GearItemMV rows, I could let the ORM handle the mapping as if it were a traditional foreign-key based relationship. This matched how I was mentally thinking about the original_table: materialized_view relationship, so it should make it easier to write queries.

The only problem is that materialized views can’t have actual foreign keys in the database.

It turns out that SQLAlchemy lets us define table-to-table relationships that effectively function as foreign-key relationships but only within in SQLAlchemy. They are never actually emitted to the database as foreign key constraints:

This results in a much cleaner @hybrid_property:

Refreshing a materialized view using SQLAlchemy:

Refreshing a materialized view is straightforward:

Unfortunately, this locks the materialized view until the refresh completes. PostgreSQL 9.4 added CONCURRENTLY to let reads continue at the expense of a longer refresh time:

My refresh function:

For convenience, I turned this into a class method on my materialized views:

All my materialized views inherit from this MaterializedView instead of the normal declarative base class. So now I can call GearItemMV.refresh() to refresh that specific view.

Similarly, I also added a function for refreshing all views:

You’ll need to tweak this if your materialized views depend on one another, or if your database includes normal (non-materialized) views. Alternatively, you can refresh all your materialized views using a custom plpgsql function.

If you’re following this blog post step-by-step and try these refresh functions, PostgreSQL will complain that refreshing concurrently requires a unique index.

Indexing a materialized view using SQLAlchemy:

Initially I tried creating an index using the normal SQLAlchemy command:

However, that threw an exception because SQLAlchemy tries to create the index as part of the db.metadata creation process, but our materialized view recipe doesn’t actually create the materialized view until after the  db.metadata creation finishes. So we need to tell SQLAlchemy to delay creating indexes on materialized views until after the materialized view is created:

Note that rather than creating a custom event to trigger when the materialized view is created, I instead hooked into the   after_create event for  db.metadata. This is the same event that my create materialized view command hooks onto, so theoretically the create index could happen before the materialized view is created. However, I asked on the SQLAlchemy mailing list, and Mike Bayer said the events are executed in deterministic order based on when they were registered as event listeners. Because the create index call is registered after the create materialized view call, everything should work fine.

A few other notes regarding indexes for materialized views:

  • IF NOT EXISTS can’t be included in the CREATE INDEX command for materialized views.
  • Creating an index uses a SHARE lock, so it doesn’t lockout reads, only writes. Since materialized views are by definition read-only, it’s pointless to create an index using CONCURRENTLY (which uses a SHARE UPDATE EXCLUSIVE lock that allows writes).
  • Using ALTER TABLE ADD INDEX syntax isn’t supported for materialized views since they aren’t tables.

Putting it all together:

I created a simple GitHub repo with working examples of the complete code. There’s both a standalone SQLAlchemy example, and a Flask-based example.

Adding a Recursive CTE:

If you enjoyed this, you might like my next blog post where I walk through how to incorporate a Recursive CTE into your materialized view.

Best practices for sharing Ansible Roles across multiple projects

Ansible is a great tool with a lot of flexibility. It’s generally the easiest configuration management tool for new users to start with due to the batteries-included philosophy, straightforward DSL, and daemonless push model.

However, as your infrastructure goals become more complex, the flexibility means it’s less obvious how things should be structured.

For example, it’s very common to reuse the same role across multiple projects, and I’ve talked with many people who handle this by copy/pasting the role in each project. Anytime they make a change to the role, they have to remember to manually update all the projects that have copies of that role, which is tedious and error-prone.

There is a better way.™ A couple of little-known Ansible features can be combined to easily share a single role across multiple projects without duplicating code.

To do this, I stick all my shared roles in a single master folder that gets shared across all my projects. This avoids the tediousness of manual copy/pasting and updating multiple copies of the same role. If you want more granularity, this technique also supports organizing groups of roles into dedicated folders–perhaps one for roles used in work projects and one for person projects.

Than I modify each project’s  ansible.cfg to tell Ansible to look for roles in that master folder in addition to the local project folder.

Sample ansible.cfg:

Ansible first searches the local project for a role, then searches the roles_path. You can specify multiple paths by separating them with colons.

By default,  ansible-galaxy install username.rolename will install the role to the roles_path configured in ansible.cfg, so that’s pretty much all you need to do.

Occasionally I want to install the role into the specific project and not the master folder. For example, to avoid version conflicts when two roles have role dependencies that require different versions of the same role. In that case, you can use the -p ROLES_PATH or --roles-path=ROLES_PATH option:

Alternatively, in your project’s requirements.yml, you can manually specify where you want a role to be installed:

If you want to customize things further, there’s currently some discussion about Ansible 2.0 adding support for multiple ansible.cfg files which would let you easily set roles_path at varying levels of specificity. Ansible will read ANSIBLE_CONFIG, ansible.cfg in the current working directory, .ansible.cfg in the home directory or /etc/ansible/ansible.cfg, whichever it finds first.

If you want to see more examples of how I use Ansible, check out my roles on Ansible Galaxy.

How to display current virtualenv in your ZSH Prezto theme

It’s been a few years since I switched from Bash to ZSH on my personal laptop, and I’ve never regretted it. Out of the box, ZSH isn’t that great, but after customizing it using Prezto it’s an amazing timesaver in a lot of little ways.

Oh-My-ZSH is the most well-known ZSH customization framework, but it’s gotten pretty bloated over the years with 200+ plugins, 140+ themes, etc. With that many plugins, some of them are bound to conflict from time to time, and keeping everything straight can be a bit of a mess. It’s also not uncommon for OMZ users who enable a lot of plugins to report shell startup times > 3 seconds, which is annoying.

Prezto started as a fork of Oh-My-ZSH focused on improving performance, and now is an entirely separate project. The Prezto maintainer is a bit gruff and much more inclined to say ‘no’ than ‘yes’ to pull requests, but I see this as a benefit because it minimizes bloat. 95% of the modules that I wanted were already built-in, and it was easy enough to fork the project to add a custom module for the Atom text editor.

Prezto is well-documented, as each module/plugin has its own Readme. Plus there are plenty of blog posts around discussing how to get started with Prezto. Personally, I found it best to just read the Readme for each module, decide which ones I wanted, and enable them. I use Gnu Stow to symlink the .zshrc, .zlogin, and .zpreztorc to my ~/.dotfiles/zsh/ folder and manage them as part of my dotfiles repo.

You can see screenshots of all the default Prezto themes here: http://mikebuss.com/2014/04/07/customizing-prezto/

I prefer the Sorin theme, as it strikes a good balance of displaying enough information to be useful without showing so much that its distracting. For example, rather than showing the full path, the prompt only displays the first character of the parent folders. This saves space, focuses the eye on the path name of the current folder, but also doesn’t leave you wondering whether you’re in first_app/requirements/ or second_app/requirements/. Behind the scenes, there are some nice touches such as retrieving Git information asynchronously in the background so it doesn’t slow down the display of the command prompt.

However, the one thing I perpetually missed in the Sorin theme was the active virtualenv – I was always typing which python. So this afternoon I finally sat down and figured out how to customize the theme to display it.

Now, whenever I activate a virtualenv, my prompt changes to:

and it just disappears if there’s no activated virtulaenv.

It’s quite easy if you want to customize your own Prezto theme–just copy the code changes in this commit: https://github.com/jeffwidman/prezto/commit/9d83811fc7359141a6da232355ce8f066b8a3e82

The only tricky part is if you want the virtualenv name on the right hand side, add $python_info[virtualenv] to RPROMPT in line 83, not line 143. I’m not an expert on ZSH scripting, but as best I can tell, this is because Sorin’s theme retrieves the git info asynchronously in the background, and then overwrites the default RPROMPT in 143 with the output from line 83 when the git info returns.

If you want to see the rest of my Prezto customizations, check out my Prezto fork (main change was adding a custom module for Atom.io) as well as my dotfiles repo where I manage my ZSH and Prezto config file customizations.

 

Configuring a single Nginx + uWSGI server to serve multiple Flask apps

Someone on the Flask mailing list asked how to serve multiple Flask apps via uWSGI + Nginx. Anytime you’re working with uWSGI there are multiple ways to do things, but here’s how I do things for RockClimbing.com. I spent several days reading the uWSGI docs + various blog posts around the net, so this should be reasonably correct.

This example shows how to serve multiple Flask apps where each app has its own domain name. If you want to mount multiple Flask apps under a single domain name, see this example in the Flask docs (the pull request hasn’t been merged yet as of time of this writing).

In general, use the uWSGI Emperor, even if you’re only running a single Flask app. The Emperor is a master process that watches over the app(s) to make sure everything is running correctly. Each child app is called a ‘vassal’. If for some reason an app/vassal crashes, then the Emperor will reload it. A number of older blog posts recommend managing your uWSGI apps with supervisord… use the Emperor instead because it offers extra benefits like automatically reloading the app/vassal if you make changes to the vassal’s uWSGI config file.

Create a very basic uWSGI Emperor config called /etc/uwsgi/emperor.ini:

If you want to pass a particular option to all vassals, you can specify the option in the emperor.ini file using the vassal-set parameter.

Create a simple vassal config file in  /etc/uwsgi/vassals/app_name.ini. By default, the config file name will be used for the vassal process name. I manage my vassals using Ansible, so this config has several Jinja2 variables that look like  {{ variable }}. Just manually replace those with what you need.

You can bind the app to either a TCP socket or a Unix socket. Just make sure Nginx is passing requests to the same socket that the vassal is listening on.

For security, specify a non-root Linux user/group for the vassal to run under. Typically you’ll run each vassal as a separate user/group, and then run the Emperor as root so it can start each app process and then drop privileges before actually serving requests.

Since we’re walking through how to run multiple Flask apps, you’ll want to run each under a separate virtualenv to avoid package conflicts. For example, if one Flask app requires sqlalchemy 0.8 and another requires sqlalchemy 0.9, they’ll need to be in separate virtualenvs. uWSGI makes it easy to specify which virtualenv to run the vassal under by passing the virtualenv parameter.

For Flask, typically the callable is app, and the module is the filename where app is defined. You’ll also need to tell uWSGI to cd to the path of the Flask app before trying to import  module.

Lastly, if you do any googling about how to scale Flask + uWSGI, you’ll hit a blog post by David Cramer where he found it better to running multiple uWSGI instances and have Nginx handle the load balancing. The thundering herd problem that David experienced is better solved by setting  thunder-lock = true in your vassal config (or set it globally for all vassals in your emperor.ini config). It’s better to let uWSGI handle the load balancing rather than Nginx because Nginx doesn’t know which uWSGI processes are busy and just round-robins through them when it sends requests. If instead you let uWSGI handle the load balancing, it will intelligently pass requests to the processes that are free.

There’s also a number of options commented out–those are simply reminders to myself that those options exist, but I don’t currently use them.

At this point, test that the Emperor starts and correctly loads the vassals by running uwsgi --emperor /etc/uwsgi/emperor.ini. In production, I use systemd to manage the Emperor–the uWSGI docs have an excellent example systemd config file.

Next you need to configure Nginx to pass requests to the proper socket. Here’s an extremely simple Nginx config showing the server and location blocks. I run something a bit more complex in production, but this is easier to understand:

Feel free to ask questions or suggest improvements in the comments.

Tutorial for tuning MySQL / MariaDB for a Xenforo or WordPress VPS

Back when I first started managing my own servers, I had no idea how to tune MySQL. I would read various forum posts where people talked about the necessity of ‘proper tuning’, but when I looked at the official docs, I struggled to differentiate between what really needed tuning versus what were arcane options that only mattered in edge cases.

So here’s the blog post I wish someone else had written.

It’s a longish read, but that’s primarily because I explained why a particular change is suggested… the actual changes are very straightforward and quick. You won’t be a MySQL guru afterwards, but it’ll be good enough for 90% of website admins. And from here you’ll know enough that the official docs will start to make sense.

This is aimed at admins running websites with < 5M pageviews a month off a single server–generally blogs or forums. For example, I run a server for a friend that has a handful of forums getting ~ 3M pageviews a month on a 8GB RAM, 6 core Linode using these tuning recommendations, and the average pageview generally spends < 10ms in MySQL query time.

First, most of these settings will be set in a global my.cnf file somewhere on your server. Typically /etc/my.cnf or /etc/mysql/my.cnf.

Tuning MySQL starts with understanding the table engine. Basically this is how MySQL manages the data underneath the covers, and is typically invisible to the app developer or SQL analyst.

In the early days, MySQL shipped with a table engine called MyISAM, which is blazingly fast for single-user workloads and allowed MySQL to win many database benchmarks that only tested single-user performance. Unfortunately, MyISAM has two huge flaws:

First, it implemented table-level locking for all queries, meaning that if one user had a query running against a particular MyISAM table, no other queries could access that table until the first one finished. This doesn’t matter if you just have a single user running queries, but it causes problems as soon as there are more than a handful of concurrent website visitors.

Second, and more important, MyISAM wasn’t fully ACID compliant. It could (and did) cause data corruption in the event of a random power failure. Most admins never hit this problem, but those that did were royally screwed.

A few years later, Percona developed an alternative table engine called InnoDB. InnoDB was fully ACID compliant, so generally the data corruption issue was fixed. That alone makes it my preferred table engine.

Additionally, InnoDB switched most queries to row-level locking, meaning that only the specific row that a query was looking at would be locked. This allowed multiple queries to run simultaneously against the same table, at the expense of slightly slower throughput on any single query. For websites, InnoDB is almost always a better choice than MyISAM.

There there are some built-in tables that MySQL uses to manage itself–these are MyISAM by default and should not be changed. Only change MyISAM tables in your application’s database. Additionally, there are some special MySQL table types like ‘memory’ that should generally not be swapped over to InnoDB. If an application creates a table using one of these alternative table engines, there’s probably a very good reason for it.

Unfortunately, InnoDB didn’t support full-text indices until MySQL 5.6 (Maria DB 10). Many linux distributions still ship with MySQL 5.5 as the default, so any tables that include full-text indices can’t be migrated to InnoDB until MySQL is upgraded.

Sidenote for Xenforo admins: This is why the xf_search_index table defaults to MyISAM–it uses a full-text index. If you swap it to InnoDB, be sure to also disable the ‘

You can set the default table type for new MySQL tables using the default_storage_engine parameter. I strongly recommend setting it to InnoDB. Additionally, it’s easy to swap existing tables from MyISAM over to InnoDB using the SQL command  ALTER TABLE. You will need to stop the web application from accessing the table while it’s being converted–typically I just stop the webserver, as even the largest tables for most blogs/forums generally don’t take more than 20 minutes to swap if you’ve got fast SSDs.

When you alter to InnoDB, make sure to specify the row_format. Options are compressed, dynamic, or the default compact. dynamic  is preferred, compact is the default only due to legacy reasons. If you are bottlenecked on RAM/disk, but not CPU, then use compressed for tables where compression saves a measurable amount of space (ie, lots of text columns)… For Xenforo admins, the xf_posts table is a good candidate.

In order use the dynamic  row format, you need to change the innodb_file_format  from Barracuda to Antelope.

If possible, try to completely migrate all MyISAM tables in your application databases over to InnoDB. That way, you only need to tune InnoDB cache settings and don’t have to worry about MyISAM cache settings.

However, if you still have tables in your application database that for one reason or another are required to stay on MyISAM, then try to set key_buffer_size to slightly larger than your largest MyISAM table. If all your application tables are InnoDB, then drop key_buffer_size down to 64M. Don’t drop it all the way to 0 because the internal MySQL tables will use it since they’re still MyISAM (and shouldn’t be changed).

For InnoDB, the most important variable to change is innodb_buffer_pool_size. Basically, this is how much RAM MySQL can allocate to the InnoDB cache. For safety, the default is insanely low to make sure that a default MySQL install won’t kill your dev server. However, in production you want this number as high as possible without running out of RAM. Ideally, it’s big enough to fit all your InnoDB tables in RAM. Often that isn’t possible, especially on a VPS where typically the webserver, the db, and Linux’s filesystem cache are all competing for limited RAM. In that case, try to set the InnoDB buffer pool to at least several hundred megabytes.

Storytime: A few months after I bought RockClimbing.com, we migrated to a new host and the site promptly fell over every 10 minutes. I’d rented a pretty beefy dedicated box, and the stats showed we had plenty of unused RAM/CPU capacity. I’d never run my own server before, so it took some digging to realize that the underlying issue was the InnoDB buffer pool was still at 64M. As soon as I increased it to a couple of gigs, the problems disappeared.

For the typical forum or blog, a handful of posts get the majority of the traffic. As long as those posts can fit in the InnoDB buffer pool, you’ll typically be fine. For example, I know of a forum with ~30M posts, with total InnoDB table size of ~20gb, but a buffer pool of only ~1GB. Site works just fine because vast majority of traffic hits the most recent 100K posts.

If you’ve got a InnoDB buffer pool larger than 1 GB, then you’ll also need to adjust the innodb_buffer_pool_instances. Adding more buffer pools adds a little more concurrency, although it’s a tradeoff between having many small pools versus a few larger pools. The rule of thumb is a minimum of 1 GB per buffer pool instance. However, MariaDB (explained shortly) defaults to an insane 4 or 8 buffer pool instances once you bump over 1 GB innodb_buffer_pool_size, so be sure to manually set it lower. The benefits of more buffer pools drop off pretty quickly–even on rockclimbing.com (my largest site) where the buffer pool is 12GB, I only have 4 buffer pools.

Lastly, if your version of MySQL/MariaDB supports it, enable innodb_save_on_shutdown and innodb_load_on_startup. Basically it pre-warms your cache by saving the cache to disk whenever you shut down MySQL and reading it from disk whenever you start MySQL. That way when you reboot your server MySQL isn’t super slow for the first few pageviews.

Once you swap everything over to InnoDB and then tune the InnoDB Buffer pool, you’re pretty much good for 98% of websites.

What is MariaDB? Another history lesson: MySQL started as open source, then was acquired by Oracle. While Oracle can’t legally make the database closed source, they’ve been trying to close source as much as possible. The guy who started MySQL started a MySQL replacement project called MariaDB. It adds a number of nice features, including a small speed increase.

It’s binary compatible with MySQL, so it shouldn’t cause any issues with your app. It’s also easy to install, just use the pre-packaged RPMs they provide. I’ve been using it on CentOS 7 as a drop-in replacement for MySQL with zero issues.

You’ll want to switch to MariaDB 10, as that’s the equivalent of MySQL 5.6 and adds the full text indices to InnoDB. MariaDB 10 also adds some nifty features like a regex-based search-and-replace. Very handy for updating urls after migrating an old vBulletin forum over to Xenforo.

One MySQL tuning tip that you’ll find in many blog posts is to increase the size of the MySQL Query Cache. However, the benefits are hotly debated, many people say it actually slows things down because in many situations because you’re losing more time checking the cache and finding it doesn’t have what you want than if you just bypassed the cache altogether. I just disable it most of the time.

Keep yourself out of MySQL encoding hell! I’ve been there, it’s not fun. Set the default character set in the mysql server. UTF-8 is probably what you want. MySQL’s UTF-8 is actually pseudo-UTF8, the real UTF-8 is UTF-8MB4. WordPress switched to MB4 in WP 4.2, Xenforo still on UTF-8. You’ll also want to set the default collation, likely to utf8mb4_unicode_ci .

Sidenote: I hate how MySQL has sloppy engineering. For example, MyISAM not actually being ACID compliant and losing data. UTF-8 not actually being UTF-8. Drives me nuts. If I have the choice, I much prefer PostgreSQL. They may be slower on feature releases, but at least they get it right when they do release something.

Lastly, a security tip. If you’re running MySQL on the same box as your webserver, then make sure MySQL only listens on localhost and rejects all other connections. For remote access, just SSH in, then access MySQL. Most GUIs, including MySQL Workbench (my favorite) support this type of connection.

The emotional side of being an entrepreneur and the values that drive me day-to-day…

I’d done a fair number of interviews lately, mostly about Facebook Analytics and EdgeRank.

So it was a refreshing change when Dale from TrekDek asked for an interview about the emotional side of being an entrepreneur and the values that drive me day-to-day.

Read the full interview.

Speaking at TEDx Zagreb in a few hours…

Here’s the rather random backstory about how I found myself in Croatia preparing to speak on “business networking.”

tedxzagreb

Eighteen months ago, Ryan Stephens graciously asked me to do a guest post on his blog about my approach to networking. I don’t pretend that I’m an amazing networker who can solve all your problems. Generally I’m upfront with people that, “I can’t solve your problems, but I might know someone who can help you.” 😉

Click here for the original guest post.

A number of people tweeted the post, and I received some attention for it. However, I was completely shocked when few weeks later, I received an e-mail from Nenad Maljkovic, saying that he saw a tweet linking to my blog post, and liked it… and would I be interested in speaking in Zagreb at TEDx?

Not every day you get invited half way around the world to speak based on a random guest post on a friend’s blog.

Thx Ryan, Nenad, Zjelko, and my other hosts here–I have truly appreciated your hospitality.

It is a bit of a dream come true to give a TED talk, as two years ago I titled an intern application “Next Life Goal: Speak at TED”.

How to Hire And Work With a Virtual Assistant

I used to rely heavily on a virtual assistant, and got a lot of questions about the process. This blog post tries to answer them. These days I spend most of my time writing code and very little time managing emails/meetings, so when my last VA graduated college and moved onto a real job, I didn’t look for a new one.

My explicit instructions for hiring a virtual assistant, including copies of my templates are below.

It all works–when my virtual assistant Katie went on vacation for  the entire summer, it took thirty minutes of my time to train someone else–fifteen minutes to explain things the new VA didn’t understand from my templates, and fifteen minutes to give her an e-mail on my Google Apps account. And she’d never worked as an assistant before!

To be clear, a virtual assistant isn’t the same as an outsourced Executive Admin. A friend of mine is the EA for the CTO of a public company, and when I asked about her job, she said a good EA does far more than simply schedule meetings:

It is part of my job to sit in on the CTO’s staff meetings, and to know the direction and goals of our company…. I am the CTO’s partner –and it is my job to make sure that the meetings I allow onto his calendar fit into the bigger picture of what we’re doing.

What I’ve learned from having a virtual assistant:

  • How to be very explicit with my instructions
  • I know better what should be outsourced. It’s almost always faster for me if it’s a one-time thing. But if it’s a repetitive task, it’s probably worth teaching her.
  • I get a heckuva lot more done–she not only removes time, she removes annoyance–that mental friction that comes from having to do tasks that I downright hate (like scheduling meetings.)
  • She not only takes care of things for me, she does them better and faster than I ever could. Face it–just as you’re uniquely talented at some things, you’re uniquely flawed in others.
  • How to teach my employees to teach themselves–it’s rewarding when my VA says she’s learned a ton from working with me!

Computer keyboard with female hands


 

How I setup my virtual assistant system:

First–what tasks are you going to have them do? How are you going to communicate with them?

I recommend before you start communicating with an actual VA that you take some time and roughly draft out your guidelines for how to manage your calendar, meetings, travel details, and contact and account information. For examples, see the end of this blog post.

Second–who to hire?

  • For a personal assistant, trust is PARAMOUNT. Katie has full access to my Paypal account, credit cards, calendar, e-mail, etc. She could REALLY mess up my life…
  • Reliable college students are often a great fit because you only need to pay them more than the minimum wages they’d be paid for washing dishes… $8-$15 an hour depending on whether they’re an independent contractor or part-time employee.
  • (Despite the naysayers in the comments below, the two college students I’ve hired have said they learned more by working for me than any other boss they’d ever had… plus they valued the flexibility and part-time nature of the job.)

 

Here are the 10 documents I use. (I just stick them on my personal wiki under a single folder titled “Reference: Assistants”.)

  1. General Information About Being My Assistant–Start Here [Self-explanatory]
  2. How to Add Events to My Calendar [Covers my five calendar categories and reminders]
  3. How to Schedule My Meetings [Addresses the people side of scheduling meetings]
  4. How to Add Someone to Jeff’s Contacts [Self-explanatory]
  5. How to Process My E-mail–Ignore For Now [For now, I find it simplest to handle my own e-mail]
  6. How to Use my Wiki [Explains my folder architecture]
  7. Press Kit [My assistant is responsible for getting press passes to events I want to attend–these links prove that I’m eligible]
  8. Travel Information [Self-explanatory]
  9. Jeff’s Contact Information [Lists contact information for me, my family, and my housemates]
  10. My Accounts [Lists my low-level usernames and passwords for different accounts across the web]

General Information About Being My Assistant–Start Here [Self-explanatory]

(as my VA, feel free to add stuff here as you think of it–ultimately, I want to have a VA document that is clear/self-explanatory/concise…)

Very First Things E-mail: [I copy and paste this into the first e-mail I send to FirstName@jeffwidman.com]

  1. I created a mail account for you – FIRSTNAME@jeffwidman.com.
  2. That account also provides access to your own Google Calendar and Google Docs on the jeffwidman.com domain
  3. I added that e-mail address as an administrator on my wiki.
  4. Go to wiki.jeffwidman.com and login to the wiki using your login instructions.
  5. Find the folder called “r: Assistants” and read the document “General Information About Being My Assistant–Start Here.”
  6. Read everything else.
  7. Go add yourself as a contact in my address book–be sure to include your mobile phone number.
  8. Setup a meeting with me in the next few days–schedule it on my calendar using the wiki instructions.
  9. (Come prepared with an agenda of things you don’t understand from reading the wiki.)
  10. Over the phone, I will give you the secure version of my password–please be VERY careful with this.
  11. We’re off and rolling!

———————–

Start by reading these articles: http://delicious.com/jeffwidman/virtual-assistant

My vision of a successful VA relationship:

– You not only handle my administrative tasks, you do so better than I ever could have.
– I tell you what I want, and you figure out how to make it happen.
– You handle my calendar, schedule meetings, and occasionally other tasks.
– You do not prevent people from contacting me, but instead force them to clarify rather than “chit-chat”.
– We both teach each other to collaborate better–suggest technology, interpersonal skills, marketing advice, etc.
– Never make it look like I’m soooo important that I need an assistant. (You’re freeing me to focus on what I do best.)

Task information:

– Deadlines are important. ‘Nuf said. (Let me know if a deadline is unrealistic.)
– Unless specified, assume time zone is Pacific Standard Zone.
– When you first start working, I NEED confirmations that you received the task (””On it—–will be done at Xpm”” is enough).
– When you complete tasks don’t require sending me anything, all I need is an e-mail that says “XXtask done”

Communicating with me:

– Skype: to clarify questions, quick status update, or confirm you received a task
– Phone: Don’t worry about calling me at a bad time. If I don’t want to answer the phone, I won’t.
– E-mail: I generally assign tasks via e-mail–easier to track over time.
– Urgent questions–use: Skype chat, phone, SMS text message. NOT e-mail.
– Non-urgent questions: just e-mail/SMS/skype chat if simple. Call if complex.
(How to leave voice mails: http://delicious.com/jeffwidman/voicemail)

Communicating with others:

– Never masquerade as Jeff.
– If someone wants to talk with me, that’s fine–see the page “How-to Schedule My Meetings”
– Be honest, be tactful, and be yourself.

Decisions:

– You are empowered to make decisions under $50. (Please notify me what you did.)

Billing:

Paypal is preferred. Let me know if you want something different.

E-mail Signature:

(Note: I’m always open to suggestions. I currently list my Google Voice #, up to you whether to change to your personal #. You are welcome to include your personal website and twitter handle as a way of advertising your services.)


Full Name

Remote Assistant for Jeff Widman

(XXX) XXX-XXXX  | YourE-mail@jeffwidman.com
YourWebsite.com   | twitter.com/YourUserName


How to Add Events to My Calendar [Covers my five calendar categories and reminders]

Be very clear about time zones.

Generally I will forward you e-mails for events to add to my calendar.

  • Unless I make additional notes, assume it’s for my day-to-day calendar.
  • E-mail me a simple “done” so I know you took care of it.
  • Do not create all-day appointments on my day-to-day calendar–either block out the specific time I’m busy (eg, 8am-6:30pm if evening free), or put in followup calendar for reminders. (Otherwise it screws up free/busy information that I share with others).
  • If duration unspecified–use your best judgment…
  • Reminders–I never use popups. E-mail preferred. Text message reminders to important meetings. Look over my calendar defaults to understand my preferences. Use your best judgment…
  • If I’m meeting someone, include the location and their phone number in the title. eg, “Meet Tim (123-123-1234) @ Location”
  • Do not use “Meet X” for community events–“meet” is a hot button for me meaning I need to be there.
  • If it’s a physical location that I don’t regularly visit, please put the physical address in the location so I can quickly Google map it. (If e-mail says “Jason’s house” just put that in the location–I’ll know where to go.)
  • If I e-mail you a link to an event, please put the link in the notes–often these events go under community, and I’ll attend if my schedule’s free (and want to check out the link ahead of time).

Currently, these are my calendars:

Day-to-Day

For day-to-day stuff that I need to attend.

This is the only calendar I share with my family/key friends, so if I’m busy, it needs to be on here. If I may/may not attend, it’ll probably go under community events.

Default alerts: e-mail day before and sms before any scheduled calls and meetings.

(If it’s important, change to e-mail 3 days before, and text message 10 hours before face-to-face meetings or 20 minutes before telephone chats.)

Community Events

Things I may or may not attend. Generally not important, but I want to know about. Basic settings: E-mail alerts 7 days before (I’ll make a decision then)

Followup

This is my tickler calendar to remind me of stuff I need to follow up on at a later date, or decisions I’m postponing. Never put actual events I’m attending here. Always phrase appointments with verbs. (ie, “decide on…, call X about Y, schedule…)

Default reminders: e-mail 3 days in advance and a text message the day of. (Goal: get my e-mail inbox under control, so I don’t have to get texted about stuff)

Birthday-Anniversary

Where I track family/key friend’s birthdays & anniversaries. These should be all-day events in Google Calendar. Default reminders: 7 day e-mail (I’ll decide whether to get a gift) and e-mail that day (so I can call/e-mail them).

Mentors

This is where I stick recurring monthly appointments to call specific mentors in my life. Reminders: e-mail the day of. Try to schedule these appointments at least two days apart.

Calendar for VA post


How to Schedule My Meetings [Addresses the people side of scheduling meetings]

Rule #1–use your best judgment, even in spite of these rules…

(I trust you, and I’ll let you know if I disagree.)

Rhythm:

  • I start my working day between 8 and 9am
  • I generally quit work between 6pm-1am
  • I work 6-10 hours per day
  • I go to bed around 10:30pm–although not uncommon to pull an allnighter.
  • I work best in three to five hour chunks–thus I prefer to batch a bunch of meetings/calls together.
  • As a strong extrovert, I’m normally quite energetic after a face-to-face breakfast/lunch meeting, and like to have the next few hours free to work off that energy
  • I prefer to keep my evenings (after 5:30pm) unscheduled
  • My life is full, but my schedule is currently very flexible (I like to attend lunch-time frisbee twice a week)
  • Optimum schedule: Lunch T/R, breakfast any weekday
  • You have full access to my calendar, so schedule wherever works best–we’ll refine over time
  • (I work hard to keep my calendar updated all the time. If something needs to be rescheduled, I’ll let you know.)

Responding to Meeting Requests:

  • Currently, most meeting requests from others are people reaching out and looking to connect via phone…
  • I love talking with them, but want to make sure they’re serious about the call.
  • Ask which topics they would like to discuss, and put in the calendar description notes.
  • Here’s a sample e-mail I’ve used before:

Name,

This is YourFirstName, Jeff’s virtual assistant–I manage his calendar and schedule meetings.

Jeff said he appreciated your reaching out to him–he would love to chat with you.
He normally finds a 20-30 minute phone call most efficient.

However, Jeff doesn’t want to waste your time, so he asked that you e-mail me 3 potential discussion topics/questions.

Please send me your phone number, and two good days/times to call you this week (include time zone). I’ll set an appointment on Jeff’s calendar and get back to you.

By the way, I checked out your blog–very nice! [totally optional–only if true]

Thanks!

YourFirstName

Scheduling meeting requests I’ve confirmed:

  • If they respond, email Jeff the confirmed meeting time and location–BCC’ing often works well.
  • If they ignore you. Send a 2nd e-mail three days later, and ask if they saw it–often, they just forgot to respond. Ask me if you don’t hear back after two e-mails.
  • When you communicate with anyone, be clear you’re merely facilitating a meeting–you’re not trying to make Jeff seem busy/impressive. Never be pushy.
  • Here’s a great sample e-mail:

Options:

1. How about <meeting place> at <time>?  (You initiate the meeting place.)

2. When would you like to get together?  (Let him set the date.)

Face-to-face:

  • See the “How to add events to my calendar” wiki page
  • ***Important*** Make the title “Meet w/NAME (PHONE #)”
  • Give them my  Google Voice number
  • If we’ve never met, they can see a picture of me here: http://www.jeffwidman.com/blog/about/
  • While most lunch meetings last 1 hour, I prefer to build 1.5 hours into my calendar just in case things are going really well.
  • I’m not rich, but don’t want to meet in McDonald’s either–and I try to always pickup the check. (I find having them suggest a place solves the problem.)
  • I prefer lunch meetings within fifteen minutes drive from where I work (shorter is better–currently I work from home). This way I’m gone from work for a maximum of two hours.

Restaurant Preferences:

  • I enjoy all types of food–ask if they have a favorite.
  • (Sushi, and other semi-adventurous foods are always fun.)
  • I prefer to have them suggest a place–unless they’re from out of town, then ask me.
  • In my Google Docs, under Reference, there’s a spreadsheet of Bay Area Restaurants–please add to it when I visit a new place.
  • If they don’t know, call Jason XXXX or Andrea XXXXX (in my address book), tell ’em you’re my Assistant, and ask for ideas–they’re both foodies and know me/my style.

How to Add Someone to Jeff’s Contacts [Self-explanatory]


How to Process My E-mail–Ignore For Now [For now, I find it simplest to handle my own e-mail]


How to Use my Wiki [Explains my folder architecture]


Press Kit [I (used to) write irregularly for VentureBeat & Mashable. My assistant is responsible for getting press passes to events I want to attend–these links prove that I’m eligible]

http://delicious.com/jeffwidman/articles Examples of articles Jeff has written, useful when applying for press passes.

http://delicious.com/jeffwidman/press-kit for random stuff. (Website with articles/content by or that mention Jeff.)

http://delicious.com/jeffwidman/interviews for when people have questions about interviewing me.


Travel Information [Self-explanatory]

(Right now, I travel infrequently, so we’ll handle plane tickets, hotel rooms, etc on a case-by-case basis.)

All travel related emails–like itineraries, hotel rooms, conference registration numbers, etc–should be labeled with the “Travel Details” label–that label maps to a folder on Jeff’s iPod and phone so that Jeff can easily pull them up on his phone when he needs them.

Whenever you buy plane tickets, please forward the itinerary information to “plans@tripit.com” (Unless I specify the trip is to surprise someone–my Tripit account makes my travel plans publicly available.)

Tripit( www.tripit.com ) Account information can be found in Lastpass.


Jeff’s Contact Information [Lists contact information for me, my family, and my housemates]

Public Contact information-Can give to anyone:

E-mail: jeff@jeffwidman.com

Google Voice Number: (XXX) XXX-XXXX

Private – Do not give to anyone:

Skype:
GChat:

Jeff’s personal cell phone:
Dad’s cell phone:
Parent’s home phone:
Sister cell phone:

My home address:

Housemate #1 Cell Phone:
Housemate #2 Cell Phone

My parent’s home address:

Time where I live:
[I used an embeddable clock widget that shows the time and date where I live. Handy if your VA lives in a different time zone.]



My Accounts

I used to list my low-level usernames and passwords for different accounts across the web, now I just stick it all in a shared Lastpass folder.

I’m officially a college graduate!

Hey all–I officially graduated college in the middle of September. Thx for all the support, encouragement, and friendly ribbing… 🙂

PS: Out of all the time I’ve spent learning how to negotiate, this was by far the most useful 90 minutes: Stanford Podcast with Joel Peterson.