Jeff Widman

July 27, 2024

Traversing a tree stored as an adjacency list using a Recursive CTE built in SQLAlchemy

This is part of a series on combining PostgreSQL Materialized Views, Recursive CTEs, and SQLAlchemy:

Using SQLAlchemy to create and manage PostgreSQL Materialized Views
Traversing a tree stored as an adjacency list using a Recursive CTE built in SQLAlchemy (this post)

The problem:

Recently, while working on the RockClimbing.com codebase, I encountered several situations where data needed to be pre-calculated/cached because on-the-fly calculations were too slow.

For example, RockClimbing.com has over 100,000 climbing routes scattered across 35,000 locations.The locations are essentially a tree structure, ranging in specificity from general area to a specific wall: “North America” > “California” > “Yosemite” > “El Capitan” > “North Face”.

As you browse different areas, it’d be nice to see the total number of climbing routes within each area. For example, if you’re viewing North America, on each state we’d like to display the total number of climbing routes in that state.

In other words, for each node, find all child nodes and then sum the routes attached to those child nodes.

There’s a number of ways to map a tree structure to SQL tables. For simplicity, I store the locations as an adjacency list. Adjacency lists make it easy to insert and reorder nodes, although traversing the tree (especially downwards) can be slow because you have to walk through all the nodes in order to identify the children. An index helps, but it’s still not fast, especially when you need to traverse multiple levels deep.

Trying to solve it with Python:

My initial prototype looked something like this:

def recursive_child_locations(location):
	child_l = []
	if len(location.children) > 0: 
		for child in location.children:
			child_l.extend(recursive_child_locations(location))
	return child_l

recursive_child_route_count = db.session.query(Routes).filter(Routes.location. \
                                                          in_(location, recursive_child_locations(self))).count()

def recursive_child_locations(location):

child_l = []

if len(location.children) > 0:

for child in location.children:

child_l.extend(recursive_child_locations(location))

return child_l

recursive_child_route_count = db.session.query(Routes).filter(Routes.location. \

in_(location, recursive_child_locations(self))).count()

This worked fine with my normal test dataset of 100 locations, but became too slow once I increased the test dataset to 50,000 locations.

Switching to a while loop would improve things slightly since Python doesn’t support tail recursion. But the bigger problem is that accessing location.children is expensive. My location model is configured with lazy='dynamic' which forces SQLAlchemy to issue a new query for every location. A better solution is pre-fetching all the locations and then iterating through them in memory, but that’s still a lot of data being unnecessarily transferred between the database and the app.

Another attempt using a Recursive CTE:

Instead of doing all this data transfer, if I can write a database query that traverses the tree, then I could handle most of the filtering work in the database and only return the results. Plus the database C code will be much faster than interpreted Python code.

Traditionally, traversing adjacency lists was impossible with normal SQL and required writing custom database functions. Thankfully, we’re using PostgreSQL which supports Recursive CTEs (a quick introduction) that make it a good deal easier:

--Find the recursive child locations under the current node
WITH RECURSIVE children(id, parent_id) AS (
    --base case
    SELECT location.id AS id, location.parent_id AS parent_id
    FROM location
    WHERE location.id = 37  -- setting this to id includes current object
                             -- setting to parent_id includes only children
    --combine with recursive part
    UNION ALL 
    SELECT location.id AS location_id
         , location.parent_id AS location_parent_id
    FROM location, children
    WHERE children.id = location.parent_id
)
TABLE children;

--Now calculate count of routes attached to these locations
SELECT count(route.id) AS count_recursive_child_items_for_single_loc
FROM route, children
WHERE route.location_id = children.id;

--Find the recursive child locations under the current node

WITH RECURSIVE children(id, parent_id) AS (

--base case

SELECT location.id AS id, location.parent_id AS parent_id

FROM location

WHERE location.id = 37 -- setting this to id includes current object

-- setting to parent_id includes only children

--combine with recursive part

UNION ALL

SELECT location.id AS location_id

, location.parent_id AS location_parent_id

FROM location, children

WHERE children.id = location.parent_id

)

TABLE children;

--Now calculate count of routes attached to these locations

SELECT count(route.id) AS count_recursive_child_items_for_single_loc

FROM route, children

WHERE route.location_id = children.id;

I tested this using a test dataset to 35,000 locations and 100,000 routes. For the most expensive queries near the top of the tree it took about 200ms-250ms of database time to walk the tree and then calculate the cumulative route count. That’s a huge speedup over doing all the work in Python, but still not fast enough to dynamically generate it on every page load, especially since some pages display the route counts for multiple locations, so they’d need to run the query multiple times.

Pre-calculating the result and storing it:

The nice thing about this problem is we don’t require the results to always be 100% up to date. Only a handful of new routes are added every day, and nothing bad happens if someone browsing the website sees a count that is a few hours out of date.

So a better solution is to pre-calculate the result and cache it somewhere.

I decided against adding a count_recursive_routes column to the location table because it felt hacky–I’d rather separate the calculated/cached data from the original data. I expect the number of things I want to pre-calculate to grow over time, and I don’t want to keep stacking extra columns on the table. Using an in-memory datastore like Redis is another common solution, but so far we haven’t needed it, and I was hesitant to add the extra complexity to our stack just for this.

Instead, I decided to use a PostgreSQL Materialized View. If you’re not familiar with materialized views, my previous blog post provides a good overview.

Generalizing the Recursive CTE across the entire table:

The recursive CTE solution above calculates the recursive route count for a single location, but for our materialized view, we need to generalize the query so it returns route counts for all locations. On the surface, that sounds easy, but it turned out to be much harder. I finally turned to StackOverflow where Erwin Brandstetter suggested solving it using a PostgreSQL ARRAY:

WITH RECURSIVE tree AS (
   SELECT id, parent_id, ARRAY[id] AS path
   FROM   location
   WHERE  parent_id IS NULL

   UNION ALL 
   SELECT c.id, c.parent_id, path || c.id
   FROM   tree t
   JOIN   location c ON c.parent_id = t.id
   )
, tree_ct AS (
   SELECT t.id, t.path, COALESCE(i.item_ct, 0) AS item_ct
   FROM   tree t
   LEFT   JOIN  (
      SELECT location_id AS id, count(*) AS item_ct
      FROM   route
      GROUP  BY 1
      ) i USING (id)
   )
SELECT t.id
     , t.item_ct       AS count_direct_child_items
     , sum(t1.item_ct) AS count_recursive_child_items 
FROM   tree_ct t
LEFT   JOIN tree_ct t1 ON t1.path[1:array_upper(t.path, 1)] = t.path
GROUP  BY t.id, t.item_ct;

WITH RECURSIVE tree AS (

SELECT id, parent_id, ARRAY[id] AS path

FROM location

WHERE parent_id IS NULL

UNION ALL

SELECT c.id, c.parent_id, path || c.id

FROM tree t

JOIN location c ON c.parent_id = t.id

)

, tree_ct AS (

SELECT t.id, t.path, COALESCE(i.item_ct, 0) AS item_ct

FROM tree t

LEFT JOIN (

SELECT location_id AS id, count(*) AS item_ct

FROM route

GROUP BY 1

) i USING (id)

)

SELECT t.id

, t.item_ct AS count_direct_child_items

, sum(t1.item_ct) AS count_recursive_child_items

FROM tree_ct t

LEFT JOIN tree_ct t1 ON t1.path[1:array_upper(t.path, 1)] = t.path

GROUP BY t.id, t.item_ct;

This query took approximately 18 minutes to calculate results for the entire test dataset of 35,000 locations and 100,000 routes. This is plenty fast for something that only needs to run once a day as a background job.

Converting the Recursive CTE to a SQLAlchemy selectable:

In order to use the materialized view with SQLAlchemy, I needed to convert the raw SQL query into a SQLAlchemy selectable and pass it to our custom create_mat_view() function.

Converting the recursive CTE turned out to be more straightforward than I expected. I incrementally built the various components, then combined them into a single query. As mentioned previously, if you’re using Flask-SQLAlchemy, the query needs to be built using the SQLAlchemy core select() function, rather than the more typical session.query().

Anytime you’re translating a complicated SQL query to SQLAlchemy, it’s much easier if you print the individual components to make sure you’re assembling the query correctly. In the below code, I left in a few of these print statements that I used for debugging. You’ll notice that sometimes to see the actual output you need to tell SQLAlchemy to compile the statement using the postgresql.dialect().

_tree_cte = db.select([
                  Location.id, Location.parent_id,
                  # TODO swap postgresql.array to db.array once SQLalchemy 1.1 releases
                  postgresql.array([Location.id]).label('path')
                  ]).where(Location.parent_id==None
                  ).cte(name='tree', recursive=True)
  _tree_cte = _tree_cte.union_all(db.select([
                  Location.id, Location.parent_id,
                  # TODO swap postgresql.array to db.array once SQLalchemy 1.1 releases
                  (_tree_cte.c.path + postgresql.array([Location.id])
                                                          ).label('path'),
                  ]).select_from(db.join(_tree_cte, Location,
                      _tree_cte.c.id==Location.parent_id)))
  # print(db.select([_tree_cte]).compile(dialect=postgresql.dialect())) # DEBUG
  direct_count = db.select([Route.category_id.label('id'),
                          db.func.count(Route.id).label('item_ct'),
                  ]).group_by(Route.category_id).alias('direct_count')
  _tree_count_cte = db.select([_tree_cte.c.id, _tree_cte.c.path,
                  db.func.coalesce(direct_count.c.item_ct, 0
                                                      ).label('item_ct'),
                  ]).select_from(
                      db.join(_tree_cte, direct_count,
                              _tree_cte.c.id==direct_count.c.id, isouter=True)
                  ).cte(name='tree_ct')
  # print(db.select([_tree_count_cte]).compile(dialect=postgresql.dialect())) # DEBUG
  _aliased_tree_count_cte = _tree_count_cte.alias('t1')


gc_mv_query = db.select([
                  _tree_count_cte.c.id.label('id'),
                  _tree_count_cte.c.item_ct.label('count_direct_child_items'),
                  db.func.sum(_aliased_tree_count_cte.c.item_ct
                                  ).label('count_recursive_child_items'),
                  ]).select_from(
                      db.join(_tree_count_cte, _aliased_tree_count_cte,
                          _tree_count_cte.c.path==
                              _aliased_tree_count_cte.c.path[1: 
                              db.func.array_upper(_tree_count_cte.c.path, 1)], 
                          isouter=True)
                  ).group_by(_tree_count_cte.c.id, _tree_count_cte.c.item_ct)
  # print(gc_mv_query.compile(dialect=postgresql.dialect())) # DEBUG

_tree_cte = db.select([

Location.id, Location.parent_id,

# TODO swap postgresql.array to db.array once SQLalchemy 1.1 releases

postgresql.array([Location.id]).label('path')

]).where(Location.parent_id==None

).cte(name='tree', recursive=True)

_tree_cte = _tree_cte.union_all(db.select([

Location.id, Location.parent_id,

# TODO swap postgresql.array to db.array once SQLalchemy 1.1 releases

(_tree_cte.c.path + postgresql.array([Location.id])

).label('path'),

]).select_from(db.join(_tree_cte, Location,

_tree_cte.c.id==Location.parent_id)))

# print(db.select([_tree_cte]).compile(dialect=postgresql.dialect())) # DEBUG

direct_count = db.select([Route.category_id.label('id'),

db.func.count(Route.id).label('item_ct'),

]).group_by(Route.category_id).alias('direct_count')

_tree_count_cte = db.select([_tree_cte.c.id, _tree_cte.c.path,

db.func.coalesce(direct_count.c.item_ct, 0

).label('item_ct'),

]).select_from(

db.join(_tree_cte, direct_count,

_tree_cte.c.id==direct_count.c.id, isouter=True)

).cte(name='tree_ct')

# print(db.select([_tree_count_cte]).compile(dialect=postgresql.dialect())) # DEBUG

_aliased_tree_count_cte = _tree_count_cte.alias('t1')

gc_mv_query = db.select([

_tree_count_cte.c.id.label('id'),

_tree_count_cte.c.item_ct.label('count_direct_child_items'),

db.func.sum(_aliased_tree_count_cte.c.item_ct

).label('count_recursive_child_items'),

]).select_from(

db.join(_tree_count_cte, _aliased_tree_count_cte,

_tree_count_cte.c.path==

_aliased_tree_count_cte.c.path[1:

db.func.array_upper(_tree_count_cte.c.path, 1)],

isouter=True)

).group_by(_tree_count_cte.c.id, _tree_count_cte.c.item_ct)

# print(gc_mv_query.compile(dialect=postgresql.dialect())) # DEBUG

The final print debug statement prints a SQL query that exactly matches the definition of the materialized view (excluding the CREATE MATERIALIZED VIEW portion. Now we just need to pass this to the create_mat_view() function:

class GearCategoryMV(MaterializedView):
    __table__ = create_mat_view("gear_category_mv", _gc_mv_query)

db.Index('gear_category_mv_id_idx', GearCategoryMV.id, unique=True)

class GearCategoryMV(MaterializedView):

__table__ = create_mat_view("gear_category_mv", _gc_mv_query)

db.Index('gear_category_mv_id_idx', GearCategoryMV.id, unique=True)

Querying:

You can read about how to query a materialized view using SQLAlchemy in my previous blog post.

Wrapup:

The more I use SQLAlchemy, the more impressed I am with it (and Mike Bayer, the primary developer behind it). Not many ORMs support creating materialized views using a Recursive CTE, the PostgreSQL-specific ARRAY datatype, and more.

Using SQLAlchemy to create and manage PostgreSQL Materialized Views

This is part of a series on combining PostgreSQL Materialized Views, Recursive CTEs, and SQLAlchemy:

Overview of materialized views in SQL:

If you’re not familiar with SQL views, they’re basically a temporary table whose contents are the output of another SQL query. Normally the table generated by the view only lasts for the duration of a single containing query, but a materialized view stores (materializes) the results so they can be used in later queries.

Materialized views offer a number of benefits:

Fast and simple retrieval. Retrieving data from a materialized view is just like writing queries against a normal table:

SELECT calculated_result FROM my_mat_view WHERE id = 4

1

SELECT calculated_result FROM my_mat_view WHERE id = 4
Unlike normal views, materialized views (at least in PostgreSQL) support traditional database indexes for even faster retrieval:

CREATE UNIQUE INDEX ON my_mat_view (id)

1

CREATE UNIQUE INDEX ON my_mat_view (id)
Simple data maintenance. Recalculating all the data is a simple:

REFRESH MATERIALIZED VIEW my_mat_view

1

REFRESH MATERIALIZED VIEW my_mat_view
Zero-downtime refresh. PostgreSQL 9.4 onward supports refreshing a materialized view concurrently in the background without impacting other queries:

REFRESH MATERIALIZED VIEW CONCURRENTLY my_mat_view

1

REFRESH MATERIALIZED VIEW CONCURRENTLY my_mat_view
Doesn’t add complexity to your infrastructure because it’s built into your database and requires no additional services.

Drawbacks of materialized views:

Any data you want to store in a materialized view has be the result of another SQL query–you can’t generate the data in your app and then store it in the materialized view.
Like any cache, data can go stale. You need to remember to manually issue a REFRESH command to your database. Either your app can update the materialized view after it inserts/updates the source data, or if that causes too many updates, just run a cron job that refreshes the materialized view on a regular schedule. You can also use a database trigger, although you’ll need to create a custom plpgsql function.
You can’t update a single row–only the entire view.

Overall, materialized views in PostgreSQL are a very convenient way to add caching to many applications.

Ways to teach SQLAlchemy about a materialized view:

In order to write ORM queries against my materialized views, I needed to tell SQLAlchemy about the materialized view. Querying a materialized view is very similar to a normal SQL table, so the easiest way to do this is manually create the materialized view in your database. Then, within your app, create a normal SQLAlchemy Table mapping that corresponds to the columns in your materialized view.

While this makes it possible to query the materialized view, it also means that you’ll hit problems when creating/dropping all your tables because SQLAlchemy sees the materialized view as a normal SQL table. While uncommon in production, it’s very common thing to do in a test suite or your local dev instance. Additionally, using a traditional SQLAlchemy table mapping makes it difficult to refresh the materialized view from within your app. I already manage all my database tables and metadata with SQLAlchemy, and wanted the convenience of fully managing my materialized views as well.

Adding custom DDL commands to SQLAlchemy for materialized views:

SQLAlchemy doesn’t have a built-in command for creating views, let alone materialized views. But it is possible to subclass DDLElement and create your own custom DDL commands. After several iterations (and help from Mike Beyer), here’s what I ended up with:

# materialized_view_factory.py
from sqlalchemy.ext import compiler
from sqlalchemy.schema import DDLElement

from app import db # I'm using Flask-SQLAlchemy


class CreateMaterializedView(DDLElement):
 def __init__(self, name, selectable):
 self.name = name
 self.selectable = selectable

@compiler.compiles(CreateMaterializedView)
def compile(element, compiler, **kw):
 # Could use "CREATE OR REPLACE MATERIALIZED VIEW..."
 # but I'd rather have noisy errors
 return 'CREATE MATERIALIZED VIEW %s AS %s' % (
     element.name,
     compiler.sql_compiler.process(element.selectable, literal_binds=True))


def create_mat_view(name, selectable, metadata=db.metadata):
 _mt = db.MetaData() # temp metadata just for initial Table object creation
 t = db.Table(name, _mt) # the actual mat view class is bound to db.metadata
 for c in selectable.c:
 t.append_column(db.Column(c.name, c.type, primary_key=c.primary_key))

 db.event.listen(
 metadata, 'after_create',
 CreateMaterializedView(name, selectable)
 )

 db.event.listen(
 metadata, 'before_drop',
 db.DDL('DROP MATERIALIZED VIEW IF EXISTS ' + name)
 )
 return t

# materialized_view_factory.py

from sqlalchemy.ext import compiler

from sqlalchemy.schema import DDLElement

from app import db # I'm using Flask-SQLAlchemy

class CreateMaterializedView(DDLElement):

def __init__(self, name, selectable):

self.name = name

self.selectable = selectable

@compiler.compiles(CreateMaterializedView)

def compile(element, compiler, **kw):

# Could use "CREATE OR REPLACE MATERIALIZED VIEW..."

# but I'd rather have noisy errors

return 'CREATE MATERIALIZED VIEW %s AS %s' % (

element.name,

compiler.sql_compiler.process(element.selectable, literal_binds=True))

def create_mat_view(name, selectable, metadata=db.metadata):

_mt = db.MetaData() # temp metadata just for initial Table object creation

t = db.Table(name, _mt) # the actual mat view class is bound to db.metadata

for c in selectable.c:

t.append_column(db.Column(c.name, c.type, primary_key=c.primary_key))

db.event.listen(

metadata, 'after_create',

CreateMaterializedView(name, selectable)

)

db.event.listen(

metadata, 'before_drop',

db.DDL('DROP MATERIALIZED VIEW IF EXISTS ' + name)

)

return t

Security Note: Using string interpolation to assemble SQL commands always makes me nervous about accidentally creating a SQL injection vulnerability. None of these commands include user-input, so we’re fine. However, be careful not to accidentally re-use this code somewhere else where you are accepting user input.

Creating a materialized view using SQLAlchemy:

This lets us create a materialized view by passing a SELECT statement generated with SQLAlchemy core to our custom create_mat_view() function.

Here’s an example of how you’d call it–this calculates the total number of reviews and average review rating for each GearItem:

class GearItemMV(db.Model):
    __table__ = create_mat_view("gear_item_mv",
                    db.select(
                        [GearItem.id.label('id'),
                        db.func.count(GearReview.id).label('review_count'),
                        db.func.avg(GearReview.rating).label('review_rating'),]
                    ).select_from(db.join(GearItem, GearReview, isouter=True)
                    ).group_by(GearItem.id))

class GearItemMV(db.Model):

__table__ = create_mat_view("gear_item_mv",

db.select(

[GearItem.id.label('id'),

db.func.count(GearReview.id).label('review_count'),

db.func.avg(GearReview.rating).label('review_rating'),]

).select_from(db.join(GearItem, GearReview, isouter=True)

).group_by(GearItem.id))

The above code will emit the following DDL statement to the database:

CREATE MATERIALIZED VEW AS 
SELECT 
  gear_item.id AS id, 
  count(gear_review.id) AS review_count, 
  avg(gear_review.rating) AS review_rating
FROM 
  gear_item 
LEFT OUTER JOIN 
  gear_review 
    ON gear_item.id = gear_review.gear_item_id 
GROUP BY
  gear_item.id

CREATE MATERIALIZED VEW AS

SELECT

gear_item.id AS id,

count(gear_review.id) AS review_count,

avg(gear_review.rating) AS review_rating

FROM

gear_item

LEFT OUTER JOIN

gear_review

ON gear_item.id = gear_review.gear_item_id

GROUP BY

gear_item.id

Sidenote: I’m using the factory pattern to create my Flask app, so I can only pass a db.select() query to create_mat_view(). Trying to use the normal db.session.query() from the SQLAlchemy ORM won’t work because it creates a circular dependency. The Flask-SQLAlchemy sessionmaker() requires an app context, but the db object (and table mappings) are processed by Python before the app is instantiated. So when SQLAlchemy tries to setup the table mappings, calling db.session() calls the Flask-SQLAlchemy sessionmaker() which throws an exception because there’s no app context. Using db.select() from SQLAlchemy Core avoids this issue because it doesn’t create a session. Within the selectable itself, it’s fine to use ORM constructs such as object references like GearReview. This works because SQLAlchemy doesn’t have to create a session in order to resolve what a GearReview is.

Querying a materialized view using SQLAlchemy:

Since our recipe handles both creating the materialized view and mapping it to an ORM object, querying is very simple:

data = db.session.query(GearItemMV).get(row_id)

1	data = db.session.query(GearItemMV).get(row_id)

Alternatively, if you’re using Flask-SQLAlchemy: GearItemMV.query.get(row_id)

Additionally, this example has a 1:1 mapping between the GearItem objects and the pre-calculated data about them stored in GearItemMV, which makes for convenient queries like this:

gear_item_review_count = db.session.query(GearItemMV).get(gear_item_id).review_count

1	gear_item_review_count = db.session.query(GearItemMV).get(gear_item_id).review_count

While convenient, this results in brittle code because we have direct queries against the materialized view littered throughout our codebase. A better way is to map the materialized view data as attributes on the original objects, decoupling the attribute call from the underlying implementation of how the attribute is calculated/cached. This avoids having to change a bunch of controller/view code if we switch caching mechanisms.

Mapping the cached result as an attribute on the original object:

SQLAlchemy supports several ways to map the materialized view output onto the original object.

Initially I tried using a column_property():

GearItem.review_count = db.column_property(db.select([GearItemMV.review_count]
                                              ).where(GearItem.id==GearItemMV.id))

1 2	GearItem.review_count = db.column_property(db.select([GearItemMV.review_count] ).where(GearItem.id==GearItemMV.id))

Because column_properties are evaluated at import time, this has to be defined after the GearItemMV definition, it cannot be declared inline with the GearItem model definition.

There were three main problems with this approach:

Under the covers, column_property() uses correlated subqueries. Correlated subqueries are slow because the query planner can’t optimize them–it effectively has to run a complete subquery for every row returned by the parent query. So unlike traditional set-based SQL where retrieving additional rows is O(log n), the work required to retrieve column_properties is O(n*m) where n is number of objects and m is the number of column_properties on the objects.
For example, lets say my GearItem class has two column properties review_count and review_rating, and I want to retrieve thirty objects from the database. Effectively the database is processing the outer query to return thirty items, plus an additional query for each correlated subquery for each item. You won’t notice this in the SQL query logs because only one query is sent to the database, but if you actually run EXPLAIN on the query, you’ll see that effectively the one query ballooned into doing the work of 61 queries!
SQLAlchemy can’t take advantage of the pseudo-caching ability of the session since this query doesn’t use session.query.get(). Normally, once an object has been loaded within a session, querying it again using db.session.query(object).get(pk_id) will retrieve the existing object from the SQLAlchemy session without re-querying the database. This only happens using the special get() function which first checks the session’s identity map for the primary key.
Column properties are calculated as soon as an object is loaded. So whenever I load a GearItem, even if I never access the review data, I’m still triggering the expensive correlated subquery.

After a little more research, I switched to mapping the attributes onto the parent object using a SQLAlchemy @hybrid_property. Like a normal python property, it’s only evaluated on access, plus it provides additional benefits.

Here was my initial attempt:

# within GearItem model definition
@hybrid_property
def review_count(self):
    # if GearItem object doesn't have ID, it was just created and hasn't been
    # flushed to the database yet, so it won't have corresponding entry in
    # materialized view
    if self.id:
        gi_mv = GearItemMV.query.get(self.id)
        # GearItemMV definition uses a LEFT JOIN so if self.id exists, yet
        # GearItemMV(self.id) doesn't exist, than we know the materialized view
        # needs refreshing
        if gi_mv is not None:
            return gi_mv.review_count

# within GearItem model definition

@hybrid_property

def review_count(self):

# if GearItem object doesn't have ID, it was just created and hasn't been

# flushed to the database yet, so it won't have corresponding entry in

# materialized view

if self.id:

gi_mv = GearItemMV.query.get(self.id)

# GearItemMV definition uses a LEFT JOIN so if self.id exists, yet

# GearItemMV(self.id) doesn't exist, than we know the materialized view

# needs refreshing

if gi_mv is not None:

return gi_mv.review_count

It works fine for querying individual items. It also caches the GearItemMV object in the session, so if I retrieve review_count and then later retrieve review_rating it will only emit one database call.

However, it doesn’t work for querying at the class level (eg, db.session.query(GearItem).filter(GearItem.review_count < 4)). The SQLAlchemy docs suggest handling this using hybrid_property.expression(), but my attempts felt needlessly overcomplicated and inelegant.

After some more thought, I realized that due to the 1:1 relationship between GearItem rows and GearItemMV rows, I could let the ORM handle the mapping as if it were a traditional foreign-key based relationship. This matched how I was mentally thinking about the original_table: materialized_view relationship, so it should make it easier to write queries.

The only problem is that materialized views can’t have actual foreign keys in the database.

It turns out that SQLAlchemy lets us define table-to-table relationships that effectively function as foreign-key relationships but only within in SQLAlchemy. They are never actually emitted to the database as foreign key constraints:

# within GearItem model definition
mat_view = db.relationship('GearItemMV', backref='original',
                           uselist=False, # makes it a one-to-one relationship
                           primaryjoin='GearItem.id==GearItemMV.id',
                           foreign_keys='GearItemMV.id')

# within GearItem model definition

mat_view = db.relationship('GearItemMV', backref='original',

uselist=False, # makes it a one-to-one relationship

primaryjoin='GearItem.id==GearItemMV.id',

foreign_keys='GearItemMV.id')

This results in a much cleaner @hybrid_property:

# within GearItem model definition
@hybrid_property
def count_gear_items(self):
    if self.mat_view is not None: # if None, mat_view needs refreshing
        return self.mat_view.count_gear_items

# within GearItem model definition

@hybrid_property

def count_gear_items(self):

if self.mat_view is not None: # if None, mat_view needs refreshing

return self.mat_view.count_gear_items

Refreshing a materialized view using SQLAlchemy:

Refreshing a materialized view is straightforward:

REFRESH MATERIALIZED VIEW mat_view_name

1	REFRESH MATERIALIZED VIEW mat_view_name

Unfortunately, this locks the materialized view until the refresh completes. PostgreSQL 9.4 added CONCURRENTLY to let reads continue at the expense of a longer refresh time:

REFRESH MATERIALIZED VIEW CONCURRENTLY mat_view_name

1	REFRESH MATERIALIZED VIEW CONCURRENTLY mat_view_name

My refresh function:

def refresh_mat_view(name, concurrently):
    # since session.execute() bypasses autoflush, must manually flush in order
    # to include newly-created/modified objects in the refresh
    db.session.flush()
    _con = 'CONCURRENTLY ' if concurrently else ''
    db.session.execute('REFRESH MATERIALIZED VIEW ' + _con + name)

def refresh_mat_view(name, concurrently):

# since session.execute() bypasses autoflush, must manually flush in order

# to include newly-created/modified objects in the refresh

db.session.flush()

_con = 'CONCURRENTLY ' if concurrently else ''

db.session.execute('REFRESH MATERIALIZED VIEW ' + _con + name)

For convenience, I turned this into a class method on my materialized views:

class MaterializedView(db.Model):
    __abstract__ = True

    @classmethod
    def refresh(cls, concurrently=True):
        '''Refreshes the current materialized view'''
        refresh_mat_view(cls.__table__.fullname, concurrently)

class MaterializedView(db.Model):

__abstract__ = True

@classmethod

def refresh(cls, concurrently=True):

'''Refreshes the current materialized view'''

refresh_mat_view(cls.__table__.fullname, concurrently)

All my materialized views inherit from this MaterializedView instead of the normal declarative base class. So now I can call GearItemMV.refresh() to refresh that specific view.

Similarly, I also added a function for refreshing all views:

def refresh_all_mat_views(concurrently=True):
    '''Refreshes all materialized views. Currently, views are refreshed in
    non-deterministic order, so view definitions can't depend on each other.'''
    # TODO use `include='materialized'` once https://bitbucket.org/zzzeek/sqlalchemy/issues/3588/ lands
    mat_views = db.inspect(db.engine).get_view_names()
    for v in mat_views:
        refresh_mat_view(v, concurrently)

def refresh_all_mat_views(concurrently=True):

'''Refreshes all materialized views. Currently, views are refreshed in

non-deterministic order, so view definitions can't depend on each other.'''

# TODO use `include='materialized'` once https://bitbucket.org/zzzeek/sqlalchemy/issues/3588/ lands

mat_views = db.inspect(db.engine).get_view_names()

for v in mat_views:

refresh_mat_view(v, concurrently)

You’ll need to tweak this if your materialized views depend on one another, or if your database includes normal (non-materialized) views. Alternatively, you can refresh all your materialized views using a custom plpgsql function.

If you’re following this blog post step-by-step and try these refresh functions, PostgreSQL will complain that refreshing concurrently requires a unique index.

Indexing a materialized view using SQLAlchemy:

Initially I tried creating an index using the normal SQLAlchemy command:

Index(GearItemMV.id).create()

1	Index(GearItemMV.id).create()

However, that threw an exception because SQLAlchemy tries to create the index as part of the db.metadata creation process, but our materialized view recipe doesn’t actually create the materialized view until after the db.metadata creation finishes. So we need to tell SQLAlchemy to delay creating indexes on materialized views until after the materialized view is created:

# this is located within the create_mat_view() function
# see full example code at end of blog post for details

@db.event.listens_for(metadata, 'after_create')
def create_indexes(target, connection, **kw):
    for idx in t.indexes:
        idx.create(connection)

# this is located within the create_mat_view() function

# see full example code at end of blog post for details

@db.event.listens_for(metadata, 'after_create')

def create_indexes(target, connection, **kw):

for idx in t.indexes:

idx.create(connection)

Note that rather than creating a custom event to trigger when the materialized view is created, I instead hooked into the after_create event for db.metadata. This is the same event that my create materialized view command hooks onto, so theoretically the create index could happen before the materialized view is created. However, I asked on the SQLAlchemy mailing list, and Mike Bayer said the events are executed in deterministic order based on when they were registered as event listeners. Because the create index call is registered after the create materialized view call, everything should work fine.

A few other notes regarding indexes for materialized views:

IF NOT EXISTS can’t be included in the CREATE INDEX command for materialized views.
Creating an index uses a SHARE lock, so it doesn’t lockout reads, only writes. Since materialized views are by definition read-only, it’s pointless to create an index using CONCURRENTLY (which uses a SHARE UPDATE EXCLUSIVE lock that allows writes).
Using ALTER TABLE ADD INDEX syntax isn’t supported for materialized views since they aren’t tables.

Putting it all together:

I created a simple GitHub repo with working examples of the complete code. There’s both a standalone SQLAlchemy example, and a Flask-based example.

Adding a Recursive CTE:

If you enjoyed this, you might like my next blog post where I walk through how to incorporate a Recursive CTE into your materialized view.

Best practices for sharing Ansible Roles across multiple projects

Ansible is a great tool with a lot of flexibility. It’s generally the easiest configuration management tool for new users to start with due to the batteries-included philosophy, straightforward DSL, and daemonless push model.

However, as your infrastructure goals become more complex, the flexibility means it’s less obvious how things should be structured.

For example, it’s very common to reuse the same role across multiple projects, and I’ve talked with many people who handle this by copy/pasting the role in each project. Anytime they make a change to the role, they have to remember to manually update all the projects that have copies of that role, which is tedious and error-prone.

There is a better way.™ A couple of little-known Ansible features can be combined to easily share a single role across multiple projects without duplicating code.

To do this, I stick all my shared roles in a single master folder that gets shared across all my projects. This avoids the tediousness of manual copy/pasting and updating multiple copies of the same role. If you want more granularity, this technique also supports organizing groups of roles into dedicated folders–perhaps one for roles used in work projects and one for person projects.

Than I modify each project’s ansible.cfg to tell Ansible to look for roles in that master folder in addition to the local project folder.

Sample ansible.cfg:

[defaults]
roles_path = ~/Code/ansible_roles

1 2	[defaults] roles_path = ~/Code/ansible_roles

Ansible first searches the local project for a role, then searches the roles_path. You can specify multiple paths by separating them with colons.

By default, ansible-galaxy install username.rolename will install the role to the roles_path configured in ansible.cfg, so that’s pretty much all you need to do.

Occasionally I want to install the role into the specific project and not the master folder. For example, to avoid version conflicts when two roles have role dependencies that require different versions of the same role. In that case, you can use the -p ROLES_PATH or --roles-path=ROLES_PATH option:

ansible-galaxy install username.rolename -p ~/Code/project_deploy/ansible/roles/

1	ansible-galaxy install username.rolename -p ~/Code/project_deploy/ansible/roles/

Alternatively, in your project’s requirements.yml, you can manually specify where you want a role to be installed:

# from galaxy
- src: jeffwidman.elasticsearch

# from private github repo, installing to a relative path
- src: https://github.com/jeffwidman/private_ansible_role
path: vagrant/roles/

# from galaxy

- src: jeffwidman.elasticsearch

# from private github repo, installing to a relative path

- src: https://github.com/jeffwidman/private_ansible_role

path: vagrant/roles/

If you want to customize things further, there’s currently some discussion about Ansible 2.0 adding support for multiple ansible.cfg files which would let you easily set roles_path at varying levels of specificity. Ansible will read ANSIBLE_CONFIG, ansible.cfg in the current working directory, .ansible.cfg in the home directory or /etc/ansible/ansible.cfg, whichever it finds first.

If you want to see more examples of how I use Ansible, check out my roles on Ansible Galaxy.

How to display current virtualenv in your ZSH Prezto theme

It’s been a few years since I switched from Bash to ZSH on my personal laptop, and I’ve never regretted it. Out of the box, ZSH isn’t that great, but after customizing it using Prezto it’s an amazing timesaver in a lot of little ways.

Oh-My-ZSH is the most well-known ZSH customization framework, but it’s gotten pretty bloated over the years with 200+ plugins, 140+ themes, etc. With that many plugins, some of them are bound to conflict from time to time, and keeping everything straight can be a bit of a mess. It’s also not uncommon for OMZ users who enable a lot of plugins to report shell startup times > 3 seconds, which is annoying.

Prezto started as a fork of Oh-My-ZSH focused on improving performance, and now is an entirely separate project. The Prezto maintainer is a bit gruff and much more inclined to say ‘no’ than ‘yes’ to pull requests, but I see this as a benefit because it minimizes bloat. 95% of the modules that I wanted were already built-in, and it was easy enough to fork the project to add a custom module for the Atom text editor.

Prezto is well-documented, as each module/plugin has its own Readme. Plus there are plenty of blog posts around discussing how to get started with Prezto. Personally, I found it best to just read the Readme for each module, decide which ones I wanted, and enable them. I use Gnu Stow to symlink the .zshrc, .zlogin, and .zpreztorc to my ~/.dotfiles/zsh/ folder and manage them as part of my dotfiles repo.

You can see screenshots of all the default Prezto themes here: http://mikebuss.com/2014/04/07/customizing-prezto/

I prefer the Sorin theme, as it strikes a good balance of displaying enough information to be useful without showing so much that its distracting. For example, rather than showing the full path, the prompt only displays the first character of the parent folders. This saves space, focuses the eye on the path name of the current folder, but also doesn’t leave you wondering whether you’re in first_app/requirements/ or second_app/requirements/. Behind the scenes, there are some nice touches such as retrieving Git information asynchronously in the background so it doesn’t slow down the display of the command prompt.

However, the one thing I perpetually missed in the Sorin theme was the active virtualenv – I was always typing which python. So this afternoon I finally sat down and figured out how to customize the theme to display it.

Now, whenever I activate a virtualenv, my prompt changes to:

(virtualenv_name) default/sorin/prompt/&nbsp;>>>

1	(virtualenv_name) default/sorin/prompt/ >>>

and it just disappears if there’s no activated virtulaenv.

It’s quite easy if you want to customize your own Prezto theme–just copy the code changes in this commit: https://github.com/jeffwidman/prezto/commit/9d83811fc7359141a6da232355ce8f066b8a3e82

The only tricky part is if you want the virtualenv name on the right hand side, add $python_info[virtualenv] to RPROMPT in line 83, not line 143. I’m not an expert on ZSH scripting, but as best I can tell, this is because Sorin’s theme retrieves the git info asynchronously in the background, and then overwrites the default RPROMPT in 143 with the output from line 83 when the git info returns.

If you want to see the rest of my Prezto customizations, check out my Prezto fork (main change was adding a custom module for Atom.io) as well as my dotfiles repo where I manage my ZSH and Prezto config file customizations.

Configuring a single Nginx + uWSGI server to serve multiple Flask apps

Someone on the Flask mailing list asked how to serve multiple Flask apps via uWSGI + Nginx. Anytime you’re working with uWSGI there are multiple ways to do things, but here’s how I do things for RockClimbing.com. I spent several days reading the uWSGI docs + various blog posts around the net, so this should be reasonably correct.

This example shows how to serve multiple Flask apps where each app has its own domain name. If you want to mount multiple Flask apps under a single domain name, see this example in the Flask docs (the pull request hasn’t been merged yet as of time of this writing).

In general, use the uWSGI Emperor, even if you’re only running a single Flask app. The Emperor is a master process that watches over the app(s) to make sure everything is running correctly. Each child app is called a ‘vassal’. If for some reason an app/vassal crashes, then the Emperor will reload it. A number of older blog posts recommend managing your uWSGI apps with supervisord… use the Emperor instead because it offers extra benefits like automatically reloading the app/vassal if you make changes to the vassal’s uWSGI config file.

Create a very basic uWSGI Emperor config called /etc/uwsgi/emperor.ini:

[uwsgi]
emperor = /etc/uwsgi/vassals

1 2	[uwsgi] emperor = /etc/uwsgi/vassals

If you want to pass a particular option to all vassals, you can specify the option in the emperor.ini file using the vassal-set parameter.

Create a simple vassal config file in /etc/uwsgi/vassals/app_name.ini. By default, the config file name will be used for the vassal process name. I manage my vassals using Ansible, so this config has several Jinja2 variables that look like {{ variable }}. Just manually replace those with what you need.

You can bind the app to either a TCP socket or a Unix socket. Just make sure Nginx is passing requests to the same socket that the vassal is listening on.

For security, specify a non-root Linux user/group for the vassal to run under. Typically you’ll run each vassal as a separate user/group, and then run the Emperor as root so it can start each app process and then drop privileges before actually serving requests.

Since we’re walking through how to run multiple Flask apps, you’ll want to run each under a separate virtualenv to avoid package conflicts. For example, if one Flask app requires sqlalchemy 0.8 and another requires sqlalchemy 0.9, they’ll need to be in separate virtualenvs. uWSGI makes it easy to specify which virtualenv to run the vassal under by passing the virtualenv parameter.

For Flask, typically the callable is app, and the module is the filename where app is defined. You’ll also need to tell uWSGI to cd to the path of the Flask app before trying to import module.

Lastly, if you do any googling about how to scale Flask + uWSGI, you’ll hit a blog post by David Cramer where he found it better to running multiple uWSGI instances and have Nginx handle the load balancing. The thundering herd problem that David experienced is better solved by setting thunder-lock = true in your vassal config (or set it globally for all vassals in your emperor.ini config). It’s better to let uWSGI handle the load balancing rather than Nginx because Nginx doesn’t know which uWSGI processes are busy and just round-robins through them when it sends requests. If instead you let uWSGI handle the load balancing, it will intelligently pass requests to the processes that are free. If you’re wondering why thunder-lock = true isn’t enabled by default, see this note in the uWSGI help docs.

There’s also a number of options commented out–those are simply reminders to myself that those options exist, but I don’t currently use them.

[uwsgi]
socket = 127.0.0.1:3031
# for unix sockets, see `chmod-socket` and `chown-socket`

stats = 127.0.0.1:9191
# use with uwsgitop:
# http://uwsgi-docs.readthedocs.org/en/latest/StatsServer.html

logto = /var/log/uwsgi/%n.log

uid = {{ api_rc_flask_linux_user }}
gid = {{ api_rc_flask_linux_user }}

binary-path = /usr/bin/uwsgi
virtualenv = {{ api_rc_flask_virtualenv }}
chdir = {{ api_rc_flask_path }}
module = manage
callable = app

processes = 2
threads = 4
# to dynamically scale workers see:
# http://uwsgi-docs.readthedocs.org/en/latest/Cheaper.html
enable-threads = true
thunder-lock = true
# http://uwsgi-docs.readthedocs.org/en/latest/OffloadSubsystem.html
offload-threads = 1 # one per CPU is a basic start

# harakiri = 30
# respawn processes after serving 5000 requests (avoid memory leaks)
max-requests    = 5000

# clear environment on exit
# vacuum          = true

# checks python modules every N seconds for changes and autoreloads
# useful in development, avoid in production
# {% if py_autoreload is defined %}
# py-autoreload = 2
# {% endif %}


# limit-as = 512 # ?? for monitoring memory?

# From the uWSGI mailing list:
# 30-40 MB per worker is pretty normal.
#
# To gain memory you can move to multithreading:
#
# master = true
# processes = 2
# threads = 2
# thread-stacksize = 512
#
# should be good enough. Monitoring memory is a good thing, 
# use --reload-on-rss 80 to avoid your app growing up

[uwsgi]

socket = 127.0.0.1:3031

# for unix sockets, see `chmod-socket` and `chown-socket`

stats = 127.0.0.1:9191

# use with uwsgitop:

# http://uwsgi-docs.readthedocs.org/en/latest/StatsServer.html

logto = /var/log/uwsgi/%n.log

uid = {{ api_rc_flask_linux_user }}

gid = {{ api_rc_flask_linux_user }}

binary-path = /usr/bin/uwsgi

virtualenv = {{ api_rc_flask_virtualenv }}

chdir = {{ api_rc_flask_path }}

module = manage

callable = app

processes = 2

threads = 4

# to dynamically scale workers see:

# http://uwsgi-docs.readthedocs.org/en/latest/Cheaper.html

enable-threads = true

thunder-lock = true

# http://uwsgi-docs.readthedocs.org/en/latest/OffloadSubsystem.html

offload-threads = 1 # one per CPU is a basic start

# harakiri = 30

# respawn processes after serving 5000 requests (avoid memory leaks)

max-requests = 5000

# clear environment on exit

# vacuum = true

# checks python modules every N seconds for changes and autoreloads

# useful in development, avoid in production

# {% if py_autoreload is defined %}

# py-autoreload = 2

# {% endif %}

# limit-as = 512 # ?? for monitoring memory?

# From the uWSGI mailing list:

# 30-40 MB per worker is pretty normal.

# To gain memory you can move to multithreading:

# master = true

# processes = 2

# threads = 2

# thread-stacksize = 512

# should be good enough. Monitoring memory is a good thing,

# use --reload-on-rss 80 to avoid your app growing up

At this point, test that the Emperor starts and correctly loads the vassals by running uwsgi --emperor /etc/uwsgi/emperor.ini. In production, I use systemd to manage the Emperor–the uWSGI docs have an excellent example systemd config file.

Next you need to configure Nginx to pass requests to the proper socket. Here’s an extremely simple Nginx config showing the server and location blocks. I run something a bit more complex in production, but this is easier to understand:

# Very useful:
# https://www.digitalocean.com/community/tutorials/understanding-nginx-server-and-location-block-selection-algorithms
# http://blog.martinfjordvald.com/2012/08/understanding-the-nginx-configuration-inheritance-model/


server {
	listen				80;
	listen				[::]:80;
	#listen				443 ssl;
	server_name		api.example.com;

	# Static files should be served through nginx...
	# for simplicity, can handle authentication in Flask as long as X-Sendfile
	# header enabled so Flask generates url, then offloads data streaming to Nginx:
	# http://mattspitz.me/2013/11/20/serving-authenticated-media-with-nginx.html

	# root 					/srv/example.com/static/;

	access_log		/var/log/nginx/api_access.log combined buffer=32k; # TODO syslog and logrotate policy
	error_log			/var/log/nginx/api_error.log; # TODO syslog and logrotate policy

	location / {
			include uwsgi_params;
			uwsgi_pass 127.0.0.1:3031;
	}
}

# Very useful:

# https://www.digitalocean.com/community/tutorials/understanding-nginx-server-and-location-block-selection-algorithms

# http://blog.martinfjordvald.com/2012/08/understanding-the-nginx-configuration-inheritance-model/

server {

listen 80;

listen [::]:80;

#listen 443 ssl;

server_name api.example.com;

# Static files should be served through nginx...

# for simplicity, can handle authentication in Flask as long as X-Sendfile

# header enabled so Flask generates url, then offloads data streaming to Nginx:

# http://mattspitz.me/2013/11/20/serving-authenticated-media-with-nginx.html

# root /srv/example.com/static/;

access_log /var/log/nginx/api_access.log combined buffer=32k; # TODO syslog and logrotate policy

error_log /var/log/nginx/api_error.log; # TODO syslog and logrotate policy

location / {

include uwsgi_params;

uwsgi_pass 127.0.0.1:3031;

}

Feel free to ask questions or suggest improvements in the comments.

Tutorial for tuning MySQL / MariaDB for a Xenforo or WordPress VPS

Back when I first started managing my own servers, I had no idea how to tune MySQL. I would read various forum posts where people talked about the necessity of ‘proper tuning’, but when I looked at the official docs, I struggled to differentiate between what really needed tuning versus what were arcane options that only mattered in edge cases.

So here’s the blog post I wish someone else had written.

It’s a longish read, but that’s primarily because I explained why a particular change is suggested… the actual changes are very straightforward and quick. You won’t be a MySQL guru afterwards, but it’ll be good enough for 90% of website admins. And from here you’ll know enough that the official docs will start to make sense.

This is aimed at admins running websites with < 5M pageviews a month off a single server–generally blogs or forums. For example, I run a server for a friend that has a handful of forums getting ~ 3M pageviews a month on a 8GB RAM, 6 core Linode using these tuning recommendations, and the average pageview generally spends < 10ms in MySQL query time.

First, most of these settings will be set in a global my.cnf file somewhere on your server. Typically /etc/my.cnf or /etc/mysql/my.cnf.

Tuning MySQL starts with understanding the table engine. Basically this is how MySQL manages the data underneath the covers, and is typically invisible to the app developer or SQL analyst.

In the early days, MySQL shipped with a table engine called MyISAM, which is blazingly fast for single-user workloads and allowed MySQL to win many database benchmarks that only tested single-user performance. Unfortunately, MyISAM has two huge flaws:

First, it implemented table-level locking for all queries, meaning that if one user had a query running against a particular MyISAM table, no other queries could access that table until the first one finished. This doesn’t matter if you just have a single user running queries, but it causes problems as soon as there are more than a handful of concurrent website visitors.

Second, and more important, MyISAM wasn’t fully ACID compliant. It could (and did) cause data corruption in the event of a random power failure. Most admins never hit this problem, but those that did were royally screwed.

A few years later, Percona developed an alternative table engine called InnoDB. InnoDB was fully ACID compliant, so generally the data corruption issue was fixed. That alone makes it my preferred table engine.

Additionally, InnoDB switched most queries to row-level locking, meaning that only the specific row that a query was looking at would be locked. This allowed multiple queries to run simultaneously against the same table, at the expense of slightly slower throughput on any single query. For websites, InnoDB is almost always a better choice than MyISAM.

There there are some built-in tables that MySQL uses to manage itself–these are MyISAM by default and should not be changed. Only change MyISAM tables in your application’s database. Additionally, there are some special MySQL table types like ‘memory’ that should generally not be swapped over to InnoDB. If an application creates a table using one of these alternative table engines, there’s probably a very good reason for it.

Unfortunately, InnoDB didn’t support full-text indices until MySQL 5.6 (Maria DB 10). Many linux distributions still ship with MySQL 5.5 as the default, so any tables that include full-text indices can’t be migrated to InnoDB until MySQL is upgraded.

Sidenote for Xenforo admins: This is why the xf_search_index table defaults to MyISAM–it uses a full-text index. If you swap it to InnoDB, be sure to also disable the ‘Enable Delayed Insert SQL Queries’ option under ACP > Performance. Theoretically the type of table engine shouldn’t matter to the client application, but I know of at least once person who started getting weird search results after switching xf_search_index to InnoDB. Personally I haven’t experienced any problems, and I actually experienced significantly faster updates to the Xenforo search index under InnoDB. That said, my personal recommendation is if your Xenforo forum is big enough that you’re worried about tuning MySQL, you should switch to the official Xenforo Elastic Search plugin. Then since the xf_search_index isn’t used anymore, TRUNCATE it to save space. For the xf_sessions and xf_sessions_admin tables, InnoDB is fine but even better is setting up session caching using memcache.

You can set the default table type for new MySQL tables using the default_storage_engine parameter. I strongly recommend setting it to InnoDB. Additionally, it’s easy to swap existing tables from MyISAM over to InnoDB using the SQL command ALTER TABLE. You will need to stop the web application from accessing the table while it’s being converted–typically I just stop the webserver, as even the largest tables for most blogs/forums generally don’t take more than 20 minutes to swap if you’ve got fast SSDs.

When you alter to InnoDB, make sure to specify the row_format. Options are compressed, dynamic, or the default compact. dynamic is preferred, compact is the default only due to legacy reasons. If you are bottlenecked on RAM/disk, but not CPU, then use compressed for tables where compression saves a measurable amount of space (ie, lots of text columns)… For Xenforo admins, the xf_posts table is a good candidate.

In order use the dynamic row format, you need to change the innodb_file_format from Barracuda to Antelope.

If possible, try to completely migrate all MyISAM tables in your application databases over to InnoDB. That way, you only need to tune InnoDB cache settings and don’t have to worry about MyISAM cache settings.

However, if you still have tables in your application database that for one reason or another are required to stay on MyISAM, then try to set key_buffer_size to slightly larger than your largest MyISAM table. If all your application tables are InnoDB, then drop key_buffer_size down to 64M. Don’t drop it all the way to 0 because the internal MySQL tables will use it since they’re still MyISAM (and shouldn’t be changed).

For InnoDB, the most important variable to change is innodb_buffer_pool_size. Basically, this is how much RAM MySQL can allocate to the InnoDB cache. For safety, the default is insanely low to make sure that a default MySQL install won’t kill your dev server. However, in production you want this number as high as possible without running out of RAM. Ideally, it’s big enough to fit all your InnoDB tables in RAM. Often that isn’t possible, especially on a VPS where typically the webserver, the db, and Linux’s filesystem cache are all competing for limited RAM. In that case, try to set the InnoDB buffer pool to at least several hundred megabytes.

Storytime: A few months after I bought RockClimbing.com, we migrated to a new host and the site promptly fell over every 10 minutes. I’d rented a pretty beefy dedicated box, and the stats showed we had plenty of unused RAM/CPU capacity. I’d never run my own server before, so it took some digging to realize that the underlying issue was the InnoDB buffer pool was still at 64M. As soon as I increased it to a couple of gigs, the problems disappeared.

For the typical forum or blog, a handful of posts get the majority of the traffic. As long as those posts can fit in the InnoDB buffer pool, you’ll typically be fine. For example, I know of a forum with ~30M posts, with total InnoDB table size of ~20gb, but a buffer pool of only ~1GB. Site works just fine because vast majority of traffic hits the most recent 100K posts.

If you’ve got a InnoDB buffer pool larger than 1 GB, then you’ll also need to adjust the innodb_buffer_pool_instances. Adding more buffer pools adds a little more concurrency, although it’s a tradeoff between having many small pools versus a few larger pools. The rule of thumb is a minimum of 1 GB per buffer pool instance. However, MariaDB (explained shortly) defaults to an insane 4 or 8 buffer pool instances once you bump over 1 GB innodb_buffer_pool_size, so be sure to manually set it lower. The benefits of more buffer pools drop off pretty quickly–even on rockclimbing.com (my largest site) where the buffer pool is 12GB, I only have 4 buffer pools.

Lastly, if your version of MySQL/MariaDB supports it, enable innodb_save_on_shutdown and innodb_load_on_startup. Basically it pre-warms your cache by saving the cache to disk whenever you shut down MySQL and reading it from disk whenever you start MySQL. That way when you reboot your server MySQL isn’t super slow for the first few pageviews.

Once you swap everything over to InnoDB and then tune the InnoDB Buffer pool, you’re pretty much good for 98% of websites.

What is MariaDB? Another history lesson: MySQL started as open source, then was acquired by Oracle. While Oracle can’t legally make the database closed source, they’ve been trying to close source as much as possible. The guy who started MySQL started a MySQL replacement project called MariaDB. It adds a number of nice features, including a small speed increase.

It’s binary compatible with MySQL, so it shouldn’t cause any issues with your app. It’s also easy to install, just use the pre-packaged RPMs they provide. I’ve been using it on CentOS 7 as a drop-in replacement for MySQL with zero issues.

You’ll want to switch to MariaDB 10, as that’s the equivalent of MySQL 5.6 and adds the full text indices to InnoDB. MariaDB 10 also adds some nifty features like a regex-based search-and-replace. Very handy for updating urls after migrating an old vBulletin forum over to Xenforo.

One MySQL tuning tip that you’ll find in many blog posts is to increase the size of the MySQL Query Cache. However, the benefits are hotly debated, many people say it actually slows things down because in many situations because you’re losing more time checking the cache and finding it doesn’t have what you want than if you just bypassed the cache altogether. I just disable it most of the time.

Keep yourself out of MySQL encoding hell! I’ve been there, it’s not fun. Set the default character set in the mysql server. UTF-8 is probably what you want. MySQL’s UTF-8 is actually pseudo-UTF8, the real UTF-8 is UTF-8MB4. WordPress switched to MB4 in WP 4.2, Xenforo still on UTF-8. You’ll also want to set the default collation, likely to utf8mb4_unicode_ci .

Sidenote: I hate how MySQL has sloppy engineering. For example, MyISAM not actually being ACID compliant and losing data. UTF-8 not actually being UTF-8. Drives me nuts. If I have the choice, I much prefer PostgreSQL. They may be slower on feature releases, but at least they get it right when they do release something.

Lastly, a security tip. If you’re running MySQL on the same box as your webserver, then make sure MySQL only listens on localhost and rejects all other connections. For remote access, just SSH in, then access MySQL. Most GUIs, including MySQL Workbench (my favorite) support this type of connection.

The emotional side of being an entrepreneur and the values that drive me day-to-day…

I’d done a fair number of interviews lately, mostly about Facebook Analytics and EdgeRank.

So it was a refreshing change when Dale from TrekDek asked for an interview about the emotional side of being an entrepreneur and the values that drive me day-to-day.

Dale’s website is now offline, so I copied his post from archive.org:

I recently interviewed Jeff Widman, the founder of PageLever. PageLever is a funded, Y-Combinator start-up that specializes in Facebook Analytics and Facebook Insights.

I wanted to interview him because he is 2009 college graduate like myself, seems to be more successful than most of our peers, and because I thought he’d have excellent insights and advice for Generation-Y.

Our conversation was fascinating and I spent quite a bit of time thinking about the best way to put it into a blog post. I decided to outline the various themes of our interview and post relevant “sound bytes” by Jeff under those themes. I also highlighted sound bytes that are pieces of actionable advice.

On Being Lost and Comparing Yourself to you Peers

Much of my conversation with Jeff was spent talking about how he figured out what he “wanted to do” and how he compared himself to his peers. I know from personal experience that many recent college graduates feel they’re not doing as well as they should be relative to some of their more successful friends. Facebook encourages constant comparison between you and your network and gives an inaccurate representation of how everyone is doing. I wanted to find out if Jeff ever worried about this type of thing and if so, how he dealt with it.

“I think I worried about that [feeling behind] more in college. Most of my peers would take different internships in various industries and I chose to spend my summer volunteering at a boys camp in the middle of North Dakota. I felt like I was missing out on all this stuff.”

“I remember being scared coming out of college. I knew people who graduated and spent 6 to 9 months living with their parents and I thought “I’m not going to be that guy, I’m going to have a job waiting for me when I graduate.” I felt like I was way behind.

“People only post the most interesting stuff on FB and you get this world of fake lives and you think everyone else has it better than me. It’s a lie!”

“I’ve been warped since childhood as far as comparing myself to peers. I was home schooled and most of the people who I know that were home schooled are a little weird, including myself. I grew up going to play sports at the public schools and it would always be like “I was the different kid.” Certainly I could get along with them and they were my friends but I was still different and I was very used to it and comfortable with it.”

On Hustle, Networking, and Living out of your Car

Learning to hustle and network is not something they teach in college. At best, you are encouraged to join a few cubs and attend alumni networking events. After graduation, many of my peers sent out resumes to no avail. Jeff started with some internships in college, but then did something that many people are afraid to do: move somewhere new and live out of your car.

“I had an internship in manufacturing and what I noticed was that people worked very hard in manufacturing, they had a high work ethic, but a low change ethic.

I just decided that I wanted to work on a team of smart, intense people that worked hard, and had a high change ethic. I wanted a small team internally so I could be good friends with them. Externally, I wanted to meet a lot of new and interesting people.

I didn’t know anything beyond that and I was on a search to find it.”

“I finished up at TechCrunch and strung together a string of three to four, 1-3 month internships. I highly recommend it if you can afford it. I think I had negative cash flow and pretty much lived out of my car and lived very, very cheaply, but I saw not just different products, but different cultures (company cultures), different locations, and I learned a lot.

“Find the people who are the absolute best at whatever you want to do, and convince them by hook or by crook to let you spend time with them.”

“I made a list of people in Silicon Valley and contacted them through various means and none of them would hire me.”

“I was at dinner at Mint.com and said ‘I need a job so I can pay rent.’ They said, ‘Hey why don’t you do our Facebook or YouTube marketing.’

Looking back that was a great decision but it was totally accidental. I did well at it and I got more and more clients and before I knew, I had a small marketing agency and a few employees. It wasn’t premeditated in the least. I just kept moving forward and trying to perform very well and it worked out.”

“I see a lot of people exiting college and they’re not quite sure what they want to do. They go back and live with their parents and while it’s not a bad thing, if you’re not sure what you want to do, sitting around won’t help that problem. Go get a job and if you’re frustrated by it, at least you’ll know what you don’t want to do. You’ll probably say ‘here’s a few things I liked and here’s a few things I didn’t like. How can I get a job that moves more towards what I like and away from what I don’t like.’”

“Just bouncing around and getting some knocks and bruises is extremely helpful.”

“It’s also just being humble enough to do jobs that may not appeal to you at first.”

Work-Life Balance

Start-ups have a reputation of working crazy hours. Frankly, I think that’s unsustainable and think it’s strange that many start-up founders revel in it. I was curious to hear how Jeff allocated his time as a start-up founder, husband, and friend. Turns out Jeff works long hours but still makes time for himself and others.

“I spent a tremendous amount of time working, I don’t spend as much as my co-founder.

Now is the time that I want to work very hard. later on in life I want to be able to retire a bit more. If I work 150% harder now, I will maybe get 300-500% in returns.”

“I spent time volunteering at a nursing home and I’d ask the people there ‘what were you really glad you spent time on.’ They all said ‘family and friends’ and that’s something I think a lot about.

I try to make sure I have a date night every week with my wife and have dinner with her. I also make time to spend with friends. I have a list of 6-10 guy friends who I try to call once every 1-2 months and spend time with them.

I also spend one night a week volunteering with a youth group. Every time I do that it reminds me I’m just so privileged. Serving benefits you more than the people you are serving.”

“Many people have this idea that a well-balanced life is like a bento box, with a little time spent doing this, a little time spent doing that.

I don’t look at it as work is here, family is here, and friends are here, and my Christian beliefs are over there. I see them as connected. I see the work as I do as a direct result of trying to glorify God. I see it as a direct result of attempting to make a difference in the lives of people I work with.”

Dealing with Highs and Lows

Entrepreneurship is an emotional roller coaster. I think you can mitigate some of the more extreme dips if you become more process focused and not outcome focused, but it’s not completely avoidable. Jeff knows firsthand what that’s like and shared some of his thoughts on subject.

“The highs are wonderful! I love adventure, and I get a lot of adventure from my work. Business can be very, very exciting.

People have a romantic view of entrepreneurship, but sometimes it sucks! It’s such an emotional roller coaster. Even if you like what you do it can be boring and rough and it’s not glamorous.

I have two co-founders. I have a co-founder in life, my wife. She has been very stable when things have been tough. We’ll go for a walk and talk it out. She is super stabilizing. It’s great to have someone to talk things through with.

With my business co-founder, David and I we’ll encourage each other when things suck. From David, I need more of the emotional cheerleader. Sometimes he needs me to say ‘take some time off and get a little distance’.

You can’t do it alone.

Identity and Faith

For a long time I wanted to be a Navy SEAL. There was no ambiguity until I realized that was something I didn’t want to do. I can’t say that I have a coherent “identity” quite yet but I’m working on it. I asked Jeff if he felt as pulled together as he looked. I was surprised to learn that his primary identity is as a Christian. I thought of tech start-up founders as spiritual but religiously agnostic (this was not based on any facts or data). Jeff is definitely not in the midst of a quarter-life crisis.

“My identity comes from my faith as a Christian. That’s the big picture for me. It’s the lens I use to look at the rest of my life.”

“Identity is partially a choice, and partially intrinsic personality traits that you become aware of as time goes on.”

“My wife told me once, ‘You know Jeff when you started the company you were so confident, and now you’re not less confident but you know so much more and you say you know so much less.’”

Definition of Success

I wrote before that the definition of success is simply “achieving your goals.” Not everyone’s goals are the same. I wanted to find out what Jeff’s goals were and how he defined success.

“For business, it’d be great to make some money. We took some outside investment and I made a commitment to deliver a return on their investment.

At the end of the day, money is validating as a form of “Hey you picked something that makes a difference in people’s lives (generally).”

I want my investors to say “Hey I got a great return.” I want my employees to say “Hey I’d work for Jeff again because he’s a leader that cares about me even if it costs him something.” My family would say “Jeff didn’t put the business before us” and my customers to say Jeff didn’t sell us snake oil and he solved some of our problems.

Am I going to look at the end of my life and say I served my God and that I lived humbly and I loved my wife and friends and family? And not just love where they feel cared for but actively loved. Not just to make memories but to challenge them a bit, especially my friends. True friends challenge each other.

Book Recommendation

I read. A lot. I figure it’s good practice to read what successful people read. I asked Jeff if he had any recommendations.

The best book I ever read is called “The Goal.” It’s about the theory of restraints. There are a lot of things you can work on at any given point in time. What Goldratts’ book shows is that it’s so easy to work on the wrong things. The first thing you should do is fix the bottleneck. The first step to fixing the bottleneck is to identity the bottleneck.

That mentality is incredibly useful when applied to the rest of life.

Final Dale Thoughts

I really enjoyed talking with Jeff. More importantly, he is relatable. I think many people assume that tech start-up founders are mythical entities who have an innate X-Factor that they don’t have.

This is nonsense.

Jeff offered concrete pieces of advice that can be used by anyone. They boiled down to

Learn to hustle and network
Live cheaply and stay humble.
Be deliberate in learning about yourself
Develop strong relationships and put in the time to maintain them
Volunteer
Have faith (if not religion than something greater than yourself)
Don’t compare yourself to others

There you have it. Solid pieces of advice that anyone can use.

Speaking at TEDx Zagreb in a few hours…

Here’s the rather random backstory about how I found myself in Croatia preparing to speak on “business networking.”

tedxzagreb

Eighteen months ago, Ryan Stephens graciously asked me to do a guest post on his blog about my approach to networking. I don’t pretend that I’m an amazing networker who can solve all your problems. Generally I’m upfront with people that, “I can’t solve your problems, but I might know someone who can help you.” 😉

Click here for the original guest post.

A number of people tweeted the post, and I received some attention for it. However, I was completely shocked when few weeks later, I received an e-mail from Nenad Maljkovic, saying that he saw a tweet linking to my blog post, and liked it… and would I be interested in speaking in Zagreb at TEDx?

Not every day you get invited half way around the world to speak based on a random guest post on a friend’s blog.

Thx Ryan, Nenad, Zjelko, and my other hosts here–I have truly appreciated your hospitality.

It is a bit of a dream come true to give a TED talk, as two years ago I titled an intern application “Next Life Goal: Speak at TED”.

How to Hire And Work With a Virtual Assistant

I used to rely heavily on a virtual assistant, and got a lot of questions about the process. This blog post tries to answer them. These days I spend most of my time writing code and very little time managing emails/meetings, so when my last VA graduated college and moved onto a real job, I didn’t look for a new one.

My explicit instructions for hiring a virtual assistant, including copies of my templates are below.

It all works–when my virtual assistant Katie went on vacation for the entire summer, it took thirty minutes of my time to train someone else–fifteen minutes to explain things the new VA didn’t understand from my templates, and fifteen minutes to give her an e-mail on my Google Apps account. And she’d never worked as an assistant before!

To be clear, a virtual assistant isn’t the same as an outsourced Executive Admin. A friend of mine is the EA for the CTO of a public company, and when I asked about her job, she said a good EA does far more than simply schedule meetings:

It is part of my job to sit in on the CTO’s staff meetings, and to know the direction and goals of our company. I am the CTO’s partner and it is my job to make sure that the meetings I allow onto his calendar fit into the bigger picture of what were doing.

What I’ve learned from having a virtual assistant:

How to be very explicit with my instructions
I know better what should be outsourced. It’s almost always faster for me if it’s a one-time thing. But if it’s a repetitive task, it’s probably worth teaching her.
I get a heckuva lot more done–she not only removes time, she removes annoyance–that mental friction that comes from having to do tasks that I downright hate (like scheduling meetings.)
She not only takes care of things for me, she does them better and faster than I ever could. Face it–just as you’re uniquely talented at some things, you’re uniquely flawed in others.
How to teach my employees to teach themselves–it’s rewarding when my VA says she’s learned a ton from working with me!

Computer keyboard with female hands

How I setup my virtual assistant system:

First–what tasks are you going to have them do? How are you going to communicate with them?

I recommend before you start communicating with an actual VA that you take some time and roughly draft out your guidelines for how to manage your calendar, meetings, travel details, and contact and account information. For examples, see the end of this blog post.

Second–who to hire?

For a personal assistant, trust is PARAMOUNT. Katie has full access to my Paypal account, credit cards, calendar, e-mail, etc. She could REALLY mess up my life…
Reliable college students are often a great fit because you only need to pay them more than the minimum wages they’d be paid for washing dishes… $8-$15 an hour depending on whether they’re an independent contractor or part-time employee.
(Despite the naysayers in the comments below, the two college students I’ve hired have said they learned more by working for me than any other boss they’d ever had… plus they valued the flexibility and part-time nature of the job.)

Here are the 10 documents I use. (I just stick them on my personal wiki under a single folder titled “Reference: Assistants”.)

General Information About Being My Assistant–Start Here [Self-explanatory]
How to Add Events to My Calendar [Covers my five calendar categories and reminders]
How to Schedule My Meetings [Addresses the people side of scheduling meetings]
How to Add Someone to Jeff’s Contacts [Self-explanatory]
How to Process My E-mail–Ignore For Now [For now, I find it simplest to handle my own e-mail]
How to Use my Wiki [Explains my folder architecture]
Press Kit [My assistant is responsible for getting press passes to events I want to attend–these links prove that I’m eligible]
Travel Information [Self-explanatory]
Jeff’s Contact Information [Lists contact information for me, my family, and my housemates]
My Accounts [Lists my low-level usernames and passwords for different accounts across the web]

General Information About Being My Assistant–Start Here [Self-explanatory]

(as my VA, feel free to add stuff here as you think of it–ultimately, I want to have a VA document that is clear/self-explanatory/concise…)

Very First Things E-mail: [I copy and paste this into the first e-mail I send to FirstName@jeffwidman.com]

I created a mail account for you – FIRSTNAME@jeffwidman.com.
That account also provides access to your own Google Calendar and Google Docs on the jeffwidman.com domain
I added that e-mail address as an administrator on my wiki.
Go to wiki.jeffwidman.com and login to the wiki using your login instructions.
Find the folder called “r: Assistants” and read the document “General Information About Being My Assistant–Start Here.”
Read everything else.
Go add yourself as a contact in my address book–be sure to include your mobile phone number.
Setup a meeting with me in the next few days–schedule it on my calendar using the wiki instructions.
(Come prepared with an agenda of things you don’t understand from reading the wiki.)
Over the phone, I will give you the secure version of my password–please be VERY careful with this.
We’re off and rolling!

———————–

Start by reading these articles: http://delicious.com/jeffwidman/virtual-assistant

My vision of a successful VA relationship:

– You not only handle my administrative tasks, you do so better than I ever could have.
– I tell you what I want, and you figure out how to make it happen.
– You handle my calendar, schedule meetings, and occasionally other tasks.
– You do not prevent people from contacting me, but instead force them to clarify rather than “chit-chat”.
– We both teach each other to collaborate better–suggest technology, interpersonal skills, marketing advice, etc.
– Never make it look like I’m soooo important that I need an assistant. (You’re freeing me to focus on what I do best.)

Task information:

– Deadlines are important. ‘Nuf said. (Let me know if a deadline is unrealistic.)
– Unless specified, assume time zone is Pacific Standard Zone.
– When you first start working, I NEED confirmations that you received the task (”On it–will be done at Xpm” is enough).
– When you complete tasks don’t require sending me anything, all I need is an e-mail that says “XXtask done”

Communicating with me:

– Skype: to clarify questions, quick status update, or confirm you received a task
– Phone: Don’t worry about calling me at a bad time. If I don’t want to answer the phone, I won’t.
– E-mail: I generally assign tasks via e-mail–easier to track over time.
– Urgent questions–use: Skype chat, phone, SMS text message. NOT e-mail.
– Non-urgent questions: just e-mail/SMS/skype chat if simple. Call if complex.
(How to leave voice mails: http://delicious.com/jeffwidman/voicemail)

Communicating with others:

– Never masquerade as Jeff.
– If someone wants to talk with me, that’s fine–see the page “How-to Schedule My Meetings”
– Be honest, be tactful, and be yourself.

Decisions:

– You are empowered to make decisions under $50. (Please notify me what you did.)

Billing:

Paypal is preferred. Let me know if you want something different.

E-mail Signature:

(Note: I’m always open to suggestions. I currently list my Google Voice #, up to you whether to change to your personal #. You are welcome to include your personal website and twitter handle as a way of advertising your services.)

—
Full Name

Remote Assistant for Jeff Widman

(XXX) XXX-XXXX | YourE-mail@jeffwidman.com
YourWebsite.com | twitter.com/YourUserName

How to Add Events to My Calendar [Covers my five calendar categories and reminders]

Be very clear about time zones.

Generally I will forward you e-mails for events to add to my calendar.

Unless I make additional notes, assume it’s for my day-to-day calendar.
E-mail me a simple “done” so I know you took care of it.
Do not create all-day appointments on my day-to-day calendar–either block out the specific time I’m busy (eg, 8am-6:30pm if evening free), or put in followup calendar for reminders. (Otherwise it screws up free/busy information that I share with others).
If duration unspecified–use your best judgment…
Reminders–I never use popups. E-mail preferred. Text message reminders to important meetings. Look over my calendar defaults to understand my preferences. Use your best judgment…
If I’m meeting someone, include the location and their phone number in the title. eg, “Meet Tim (123-123-1234) @ Location”
Do not use “Meet X” for community events–“meet” is a hot button for me meaning I need to be there.
If it’s a physical location that I don’t regularly visit, please put the physical address in the location so I can quickly Google map it. (If e-mail says “Jason’s house” just put that in the location–I’ll know where to go.)
If I e-mail you a link to an event, please put the link in the notes–often these events go under community, and I’ll attend if my schedule’s free (and want to check out the link ahead of time).

Currently, these are my calendars:

Day-to-Day

For day-to-day stuff that I need to attend.

This is the only calendar I share with my family/key friends, so if I’m busy, it needs to be on here. If I may/may not attend, it’ll probably go under community events.

Default alerts: e-mail day before and sms before any scheduled calls and meetings.

(If it’s important, change to e-mail 3 days before, and text message 10 hours before face-to-face meetings or 20 minutes before telephone chats.)

Community Events

Things I may or may not attend. Generally not important, but I want to know about. Basic settings: E-mail alerts 7 days before (I’ll make a decision then)

Followup

This is my tickler calendar to remind me of stuff I need to follow up on at a later date, or decisions I’m postponing. Never put actual events I’m attending here. Always phrase appointments with verbs. (ie, “decide on…, call X about Y, schedule…)

Default reminders: e-mail 3 days in advance and a text message the day of. (Goal: get my e-mail inbox under control, so I don’t have to get texted about stuff)

Birthday-Anniversary

Where I track family/key friend’s birthdays & anniversaries. These should be all-day events in Google Calendar. Default reminders: 7 day e-mail (I’ll decide whether to get a gift) and e-mail that day (so I can call/e-mail them).

Mentors

This is where I stick recurring monthly appointments to call specific mentors in my life. Reminders: e-mail the day of. Try to schedule these appointments at least two days apart.

How to Schedule My Meetings [Addresses the people side of scheduling meetings]

Rule #1–use your best judgment, even in spite of these rules…

(I trust you, and I’ll let you know if I disagree.)

Rhythm:

I start my working day between 8 and 9am
I generally quit work between 6pm-1am
I work 6-10 hours per day
I go to bed around 10:30pm–although not uncommon to pull an allnighter.
I work best in three to five hour chunks–thus I prefer to batch a bunch of meetings/calls together.
As a strong extrovert, I’m normally quite energetic after a face-to-face breakfast/lunch meeting, and like to have the next few hours free to work off that energy
I prefer to keep my evenings (after 5:30pm) unscheduled
My life is full, but my schedule is currently very flexible (I like to attend lunch-time frisbee twice a week)
Optimum schedule: Lunch T/R, breakfast any weekday
You have full access to my calendar, so schedule wherever works best–we’ll refine over time
(I work hard to keep my calendar updated all the time. If something needs to be rescheduled, I’ll let you know.)

Responding to Meeting Requests:

Currently, most meeting requests from others are people reaching out and looking to connect via phone…
I love talking with them, but want to make sure they’re serious about the call.
Ask which topics they would like to discuss, and put in the calendar description notes.
Here’s a sample e-mail I’ve used before:

Name,

This is YourFirstName, Jeff’s virtual assistant–I manage his calendar and schedule meetings.

Jeff said he appreciated your reaching out to him–he would love to chat with you.
He normally finds a 20-30 minute phone call most efficient.

However, Jeff doesn’t want to waste your time, so he asked that you e-mail me 3 potential discussion topics/questions.

Please send me your phone number, and two good days/times to call you this week (include time zone). I’ll set an appointment on Jeff’s calendar and get back to you.

By the way, I checked out your blog–very nice! [totally optional–only if true]

Thanks!

YourFirstName

Scheduling meeting requests I’ve confirmed:

If they respond, email Jeff the confirmed meeting time and location–BCC’ing often works well.
If they ignore you. Send a 2nd e-mail three days later, and ask if they saw it–often, they just forgot to respond. Ask me if you don’t hear back after two e-mails.
When you communicate with anyone, be clear you’re merely facilitating a meeting–you’re not trying to make Jeff seem busy/impressive. Never be pushy.
Here’s a great sample e-mail:

Options:

1. How about <meeting place> at <time>? (You initiate the meeting place.)

2. When would you like to get together? (Let him set the date.)

Face-to-face:

See the “How to add events to my calendar” wiki page
***Important*** Make the title “Meet w/NAME (PHONE #)”
Give them my Google Voice number
If we’ve never met, they can see a picture of me here: http://www.jeffwidman.com/blog/about/
While most lunch meetings last 1 hour, I prefer to build 1.5 hours into my calendar just in case things are going really well.
I’m not rich, but don’t want to meet in McDonald’s either–and I try to always pickup the check. (I find having them suggest a place solves the problem.)
I prefer lunch meetings within fifteen minutes drive from where I work (shorter is better–currently I work from home). This way I’m gone from work for a maximum of two hours.

Restaurant Preferences:

I enjoy all types of food–ask if they have a favorite.
(Sushi, and other semi-adventurous foods are always fun.)
I prefer to have them suggest a place–unless they’re from out of town, then ask me.
In my Google Docs, under Reference, there’s a spreadsheet of Bay Area Restaurants–please add to it when I visit a new place.
If they don’t know, call Jason XXXX or Andrea XXXXX (in my address book), tell ’em you’re my Assistant, and ask for ideas–they’re both foodies and know me/my style.

How to Add Someone to Jeff’s Contacts [Self-explanatory]

How to Process My E-mail–Ignore For Now [For now, I find it simplest to handle my own e-mail]

How to Use my Wiki [Explains my folder architecture]

Press Kit [I (used to) write irregularly for VentureBeat & Mashable. My assistant is responsible for getting press passes to events I want to attend–these links prove that I’m eligible]

http://delicious.com/jeffwidman/articles Examples of articles Jeff has written, useful when applying for press passes.

http://delicious.com/jeffwidman/press-kit for random stuff. (Website with articles/content by or that mention Jeff.)

http://delicious.com/jeffwidman/interviews for when people have questions about interviewing me.

Travel Information [Self-explanatory]

(Right now, I travel infrequently, so we’ll handle plane tickets, hotel rooms, etc on a case-by-case basis.)

All travel related emails–like itineraries, hotel rooms, conference registration numbers, etc–should be labeled with the “Travel Details” label–that label maps to a folder on Jeff’s iPod and phone so that Jeff can easily pull them up on his phone when he needs them.

Whenever you buy plane tickets, please forward the itinerary information to “plans@tripit.com” (Unless I specify the trip is to surprise someone–my Tripit account makes my travel plans publicly available.)

Tripit( www.tripit.com ) Account information can be found in Lastpass.

Jeff’s Contact Information [Lists contact information for me, my family, and my housemates]

Public Contact information-Can give to anyone:

E-mail: jeff@jeffwidman.com

Google Voice Number: (XXX) XXX-XXXX

Private – Do not give to anyone:

Skype:
GChat:

Jeff’s personal cell phone:
Dad’s cell phone:
Parent’s home phone:
Sister cell phone:

My home address:

Housemate #1 Cell Phone:
Housemate #2 Cell Phone

My parent’s home address:

Time where I live:
[I used an embeddable clock widget that shows the time and date where I live. Handy if your VA lives in a different time zone.]

My Accounts

I used to list my low-level usernames and passwords for different accounts across the web, now I just stick it all in a shared Lastpass folder.

I’m officially a college graduate!

Hey all–I officially graduated college in the middle of September. Thx for all the support, encouragement, and friendly ribbing… 🙂

PS: Out of all the time I’ve spent learning how to negotiate, this was by far the most useful 90 minutes: Stanford Podcast with Joel Peterson.