<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

 <title>Kynx</title>
 <link href="http://kynx.github.io/atom.xml" rel="self"/>
 <link href="http://kynx.github.io/"/>
 <updated>2017-12-07T17:53:14+00:00</updated>
 <id>http://kynx.github.io</id>
 <author>
   <name>Matt Kynaston</name>
   <email>matt@kynx.org</email>
 </author>

 
 <entry>
   <title>Multi-tenant ETL with MySQL Proxy</title>
   <link href="http://kynx.github.io/2015/04/27/multitenant-etl-with-mysql-proxy/"/>
   <updated>2015-04-27T00:00:00+01:00</updated>
   <id>http://kynx.github.io/2015/04/27/multitenant-etl-with-mysql-proxy</id>
   <content type="html">&lt;p&gt;People can mean a number of different things when they say “multi-tenancy”. Pentaho have a (rather low quality) &lt;a href=&quot;https://www.youtube.com/watch?v=sDaDDXcV79E&quot;&gt;presentation&lt;/a&gt; of four different
use cases. Unfortunately they don’t tell you how to get any of them working.&lt;/p&gt;

&lt;p&gt;This article looks at the basic database layout and ETL design side when implementing a multi-tenanted data warehouse.&lt;!--more--&gt; It may be of use if some of these ring true:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Your tenant source databases are identically structured&lt;/li&gt;
  &lt;li&gt;You need a “master” data warehouse for high-level analysis across all tenants&lt;/li&gt;
  &lt;li&gt;Tenants will be accessing their data warehouses using 3rd party tools via ODBC / JDBC&lt;/li&gt;
  &lt;li&gt;You are using MySQL for your data warehouse&lt;/li&gt;
  &lt;li&gt;You have a single-tenant setup already that you want to roll out to other tenants&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The solution I’ve come up with evolved over time. I’ll first discuss the challenges thrown up by multi-tenant ETL, then look at how we met them. I’m not entirely sure it
isn’t a horrible hack, and there certainly are other ways of doing it. But no matter, it works.&lt;/p&gt;

&lt;h2 id=&quot;separating-tenant-data&quot;&gt;Separating tenant data&lt;/h2&gt;

&lt;p&gt;The key requirement in any multi-tenant reporting platform is that tenant data
is partitioned in such a way that there is no chance of one tenant accidentally (or maliciously) accessing another’s data.&lt;/p&gt;

&lt;p&gt;In practice this will mean either:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Having a single data warehouse and implementing some form of row-level access control&lt;/li&gt;
  &lt;li&gt;Having separate data warehouses for each tenant&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MySQL has no concept of row-level access controls, but that doesn’t rule out putting a &lt;code class=&quot;highlighter-rouge&quot;&gt;TenantID&lt;/code&gt; column on each and every row. This could then be used to filter the data for reports.&lt;/p&gt;

&lt;p&gt;However that approach has two problems: each and every report will need to include the &lt;code class=&quot;highlighter-rouge&quot;&gt;TenantID&lt;/code&gt; in it’s &lt;code class=&quot;highlighter-rouge&quot;&gt;WHERE&lt;/code&gt; clause; and there is no way to filter the rows when the tenant is coming in over ODBC.&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;TenantID&lt;/code&gt; column has its uses, which I’ll come back to further down. But for access control, unless your database natively supports row-level access, it is easier and safer to have separate data warehouses for each tenant:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;With a little bit of wiring Pentaho can switch the datasource it uses at user login: you can share reports between tenants without re-writing them&lt;/li&gt;
  &lt;li&gt;Bog-standard &lt;code class=&quot;highlighter-rouge&quot;&gt;GRANT&lt;/code&gt;s can control access for anyone coming in over ODBC&lt;/li&gt;
  &lt;li&gt;Many smaller data warehouses = better query performance&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;combining-tenant-data&quot;&gt;Combining tenant data&lt;/h2&gt;

&lt;p&gt;The next challenge comes when you need to report across all tenant data.&lt;/p&gt;

&lt;p&gt;If you are only combining some of the data, or are further transforming it for the specific uses it will be put to, this shouldn’t be a big deal: you simply create another ETL job that lifts and loads each tenant’s data to the new structure, taking care to create new surrogate keys on the dimensions and re-link the facts to these.&lt;/p&gt;

&lt;p&gt;We wanted all the data from all tenants. I’d already started on a single-tenant ETL for our guinea-pig client and didn’t fancy writing a whole new ETL process to populate the master data warehouse. I also wanted to be able to take a report written for one tenant and run it against all tenants with no modifications.&lt;/p&gt;

&lt;p&gt;If the existing ETL process could &lt;em&gt;simultaneously&lt;/em&gt; populate both the tenant and master data warehouses I thought development would be faster and maintenance simpler.&lt;/p&gt;

&lt;h2 id=&quot;the-problem-of-surrogacy&quot;&gt;The problem of surrogacy&lt;/h2&gt;

&lt;p&gt;Initially I thought it would be simple: my transform would first do a &lt;a href=&quot;http://infocenter.pentaho.com/help/index.jsp?topic=%2Fpdi_user_guide%2Fconcept_pdi_usr_dimension_lookup_update.html&quot;&gt;Dimension Lookup / Update&lt;/a&gt; on the master database, then an Update on the tenant using the surrogate key obtained from the previous step. But it’s not that simple: to keep things in sync you also nee&lt;/p&gt;

&lt;p&gt;I looked at a number of ways to achieve this simultaneous population of two data warehouses on the DB side. These included:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Tenant data warehouses composed only of views on the master that filtered by &lt;code class=&quot;highlighter-rouge&quot;&gt;TenantID&lt;/code&gt;. The ETL would then simply run against the master&lt;/li&gt;
  &lt;li&gt;Using MySQL’s &lt;a href=&quot;https://dev.mysql.com/doc/refman/5.5/en/merge-storage-engine.html&quot;&gt;MERGE&lt;/a&gt; storage engine to combine all identical tenant tables into a single master table&lt;/li&gt;
  &lt;li&gt;Hacking the PDI steps so I could specify a primary and secondary connection they would write to&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first works, but the views require maintenance and lose the performance benefits of multiple small(ish) data warehouses. A look at the amount of code I’d be modifying ruled out the third.&lt;/p&gt;

&lt;p&gt;The second had merit, but unfortunately you cannot specify which of the underlying tables in the &lt;code class=&quot;highlighter-rouge&quot;&gt;MERGE&lt;/code&gt; gets the update. However, it does illustrate the key problem I kept encountering.&lt;/p&gt;

&lt;p&gt;Let me explain. For my scheme to work, surrogate keys would have to be unique across all tenant databases.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/public/images/2015-04-27/technical_key.png&quot; alt=&quot;Technical key&quot; class=&quot;right-image keyline&quot; /&gt;
PDI’s &lt;a href=&quot;http://infocenter.pentaho.com/help/index.jsp?topic=%2Fpdi_user_guide%2Fconcept_pdi_usr_dimension_lookup_update.html&quot;&gt;Dimension Lookup / Update&lt;/a&gt; step gets the next surrogate key by doing a &lt;code class=&quot;highlighter-rouge&quot;&gt;SELECT MAX(id)&lt;/code&gt; on the same table you’re inserting into. There isn’t an option to get it from somewhere else, like a central repository of tables and latest values (although this was suggested in &lt;a href=&quot;http://jira.pentaho.com/browse/PDI-7017&quot;&gt;a JIRA&lt;/a&gt; dating back to 2011).&lt;/p&gt;

&lt;p&gt;I briefly contemplated introducing a step that inserted a dummy row in the dimension with the next surrogate key before the transform started then another step that deleted it later, but that seemed ugly and brittle.&lt;/p&gt;

&lt;p&gt;What I needed was a way to intercept the queries on the wire. Then I could:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Redirect all &lt;code class=&quot;highlighter-rouge&quot;&gt;SELECT MAX(SurrogateKey) FROM dimension&lt;/code&gt; queries to the master&lt;/li&gt;
  &lt;li&gt;Run all &lt;code class=&quot;highlighter-rouge&quot;&gt;INSERT&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;UPDATE&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;DELETE&lt;/code&gt; queries on both tenant and master&lt;/li&gt;
  &lt;li&gt;Leave any other &lt;code class=&quot;highlighter-rouge&quot;&gt;SELECT&lt;/code&gt;s to run on just the tenant&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;enter-mysql-proxy&quot;&gt;Enter MySQL Proxy&lt;/h2&gt;

&lt;p&gt;&lt;img src=&quot;/public/images/2015-04-27/mysql_proxy_switchboard.jpg&quot; alt=&quot;Dolphin at the switchboard&quot; class=&quot;left-image&quot; /&gt;
It turns out MySQL has a tool that is designed for precisely this kind of foolishness: &lt;a href=&quot;http://dev.mysql.com/doc/mysql-proxy/en/&quot;&gt;MySQL Proxy&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It sits between the client application (PDI) and the MySQL server listening on port 4040. It intercepts the queries and feeds them to a custom &lt;a href=&quot;http://www.lua.org&quot;&gt;Lua&lt;/a&gt; script, which can re-write queries, send them to different servers and manipulate the results. Power indeed.&lt;/p&gt;

&lt;p&gt;Now I’d never run into Lua before, but it’s used in stuff from digital TVs to World of Warcraft and Angry Birds. And it turns out Lua is pretty easy to learn. The &lt;a href=&quot;https://dev.mysql.com/doc/internals/en/client-server-protocol.html&quot;&gt;MySQL wire protocol&lt;/a&gt; isn’t (what wire protocol is?), but the documentation is straightforward.&lt;/p&gt;

&lt;p&gt;After a day scratching my head, here’s what I came up with:&lt;/p&gt;

&lt;script src=&quot;https://gist.github.com/kynx/37429efa154531c67cc5.js?file=mt-proxy.lua&quot;&gt; &lt;/script&gt;

&lt;p&gt;OK, that’s quite a chunk of code. I’m not going to go through it line-by-line, but hopefully the comments give some idea what’s going on. If not, feel free to pester me in the feedback section below. The key parts are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;get_mt_db()&lt;/strong&gt; This returns the name of the master database based on the name of the tenant. Adapt to your needs.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;read_query()&lt;/strong&gt; This is where we intercept the incoming query and decide what to do with it, depending on the type of command. Whenever we append a query we pass an ID as the first parameter. This is used when MySQL passes the results back to…&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;read_query_results()&lt;/strong&gt; This uses the ID to determine where the query was run and takes the appropriate action with the results and errors it finds&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;/h2&gt;

&lt;h3 id=&quot;database-connections&quot;&gt;Database connections&lt;/h3&gt;

&lt;p&gt;The first thing to do (if you haven’t already) is &lt;a href=&quot;http://wiki.pentaho.com/display/EAI/Named+Parameters&quot;&gt;parameterize your database connections&lt;/a&gt; in PDI so that the port number can be changed on the fly. This will enable you to switch from sending queries through the proxy to querying the database directly.&lt;/p&gt;

&lt;p&gt;In practice you’ll probably want two separate connections: one for your normal ETL work - which uses the proxy - and another for one-off jobs that are designed to work on one database at a time, such as DDL queries.&lt;/p&gt;

&lt;h3 id=&quot;tenantid&quot;&gt;TenantID&lt;/h3&gt;

&lt;p&gt;I mentioned adding a &lt;code class=&quot;highlighter-rouge&quot;&gt;TenantID&lt;/code&gt; column to all tables in your data warehouses while discussing row-level access controls. You will want to do this anyway so you can identify where data came from.&lt;/p&gt;

&lt;p&gt;There are two approaches. The first is to use a surrogate key and place it on every single table. However I’m a little more semantic, and put a &lt;code class=&quot;highlighter-rouge&quot;&gt;SourceDB&lt;/code&gt; varchar column on just the dimensions so I can see where the data came just by browsing the table. The downside of that approach is that to identify the provenance of a particular fact it needs to be joined on a dimension.&lt;/p&gt;

&lt;p&gt;In either case you will need a centralised database of tenants. In our setup this includes details such as which repository they use (ie “production” or “staging”), timezones (for scheduling ETL runs out of working hours) and for driving the ETL process itself.&lt;/p&gt;

&lt;h3 id=&quot;creating-new-tenants&quot;&gt;Creating new tenants&lt;/h3&gt;

&lt;p&gt;Hopefully you’re going to be adding new tenants all the time! If so, you’ll want to automate the process. To tackle this I created a new Kettle job that:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Creates an empty tenant database&lt;/li&gt;
  &lt;li&gt;Copies the structure of the tables / views / etc from the master&lt;/li&gt;
  &lt;li&gt;Populates tenant dimensions with “special” rows (ie “Unknown”, “Not found”)&lt;/li&gt;
  &lt;li&gt;Populates any common tables, like Date dimensions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is likely to be pretty specific to your data warehouse structure, but if there’s any interest I’ll post some examples.&lt;/p&gt;

&lt;h3 id=&quot;maintenance&quot;&gt;Maintenance&lt;/h3&gt;

&lt;p&gt;Our setup requires that all tenant databases and the master database share exactly the same structure. Any changes to the schema need to be performed across all databases. Again, a centralised database of tenants will help automate this process.&lt;/p&gt;

&lt;p&gt;Any DBA grey beard will tell you that when store the same piece of information in two places, sooner or later one of them will be wrong. The data-warehouse theocracy merrily cock a snoot at such normalised orthodoxy, but the grey beards have got a point.&lt;/p&gt;

&lt;p&gt;At some point something &lt;em&gt;will&lt;/em&gt; go wrong, and your master data warehouse will get out of sync with the tenants. This will happen whatever method you use to populate the master. You will need some tools to identify any differences and in the worse case rebuild the master from all the tenants. One advantage of having unique surrogate keys is that the latter can be as simple as &lt;code class=&quot;highlighter-rouge&quot;&gt;INSERT INTO table SELECT * FROM tenantdw.table&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;pros-and-cons&quot;&gt;Pros and Cons&lt;/h2&gt;

&lt;p&gt;If I haven’t scared you off by now, here’s what I’ve learned from pushing this puppy out.&lt;/p&gt;

&lt;h3 id=&quot;good-bits&quot;&gt;Good bits&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Separates multi-tenancy from the ETL process entirely: simply change the port number for your connection and you’re back to a single-tenant ETL process&lt;/li&gt;
  &lt;li&gt;No need for additional ETL jobs to populate master data warehouse&lt;/li&gt;
  &lt;li&gt;Lua script could be easily adapted so master data warehouse resides on another server from the tenant, again with no change to the ETL&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;not-so-good-bits&quot;&gt;Not so good bits&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;MySQL Proxy is officially “alpha” software. In practice this has not been an issue for us, and it is in use on some fairly large sites for jobs such as read/write splitting&lt;/li&gt;
  &lt;li&gt;Introduces some latency (~400 microseconds) to every request&lt;/li&gt;
  &lt;li&gt;Introduces another layer in the process: one more thing to go wrong, extra knowledge required before you can hack&lt;/li&gt;
  &lt;li&gt;The ETL runs for each tenant cannot run in parallel or you will get duplicate IDs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For us, the last point is probably the biggest drawback. It’s not a killer yet, but as we scale I anticipate it becoming more of an issue. Over the last year I’ve learned a few ETL tricks that would make populating the master in a more conventional, PDI-only way less of a maintenance burden. I’ll try and write those up in a future post.&lt;/p&gt;

</content>
 </entry>
 
 <entry>
   <title>Hello Crow</title>
   <link href="http://kynx.github.io/2015/04/26/hello-crow/"/>
   <updated>2015-04-26T00:00:00+01:00</updated>
   <id>http://kynx.github.io/2015/04/26/hello-crow</id>
   <content type="html">&lt;p&gt;So here it is, a blog by me. Over the years I’ve been coding (um… many, many too many years) I’ve relied heavily on the gems wonderful people have found the time to post. All along I’ve meant to start one myself. Hell, the world needs my genius. But maybe I’m just shy.
&lt;!--more--&gt;&lt;/p&gt;

&lt;p&gt;So today I decided to get to grips with this &lt;a href=&quot;http://jekyllrb.com/&quot;&gt;Jekyll&lt;/a&gt; thing that github.io uses for blogging and grabbed the &lt;a href=&quot;https://github.com/poole/lanyon&quot;&gt;simplest theme&lt;/a&gt; I could find. And here we are, up and running in minutes.&lt;/p&gt;

&lt;p&gt;OK, so the colours are a bit stark. But then again, suitably crow-like. And you don’t want to see what happens when I try and get artistic.&lt;/p&gt;

&lt;h2 id=&quot;whats-it-all-about&quot;&gt;What’s it all about?&lt;/h2&gt;

&lt;p&gt;Over the last year I’ve been getting the BI side of &lt;a href=&quot;http://www.claritum.com&quot;&gt;Claritum&lt;/a&gt; up-and-running. Previously we had a fairly expensive &lt;a href=&quot;http://www.microstrategy.com/&quot;&gt;MicroTragedy&lt;/a&gt; around reporting. This time round we had a very limited budget, but at least I’d read up on data warehousing and thought I knew the direction we should take.&lt;/p&gt;

&lt;p&gt;We settled on &lt;a href=&quot;http://www.pentaho.com/&quot;&gt;Pentaho&lt;/a&gt;. It’s open source and promised all the tools we would need in one place.&lt;/p&gt;

&lt;p&gt;One key requirement that Pentaho doesn’t offer out the box is multi-tenancy - the ability to host multiple client’s data in a single instance: this lets a client report on their own data, but not see stuff from other tenants.&lt;/p&gt;

&lt;p&gt;There isn’t a whole lot out there about how to do that, but we got there eventually.&lt;/p&gt;

&lt;p&gt;I’m a back-end guy. I couldn’t tell you how to make a pretty dashboard if I tried. But maybe if I can brighten some of the codey nerdy bits, someone will give me biscuit.&lt;/p&gt;

&lt;p&gt;So that’s what I’ll be writing about that to start with. If it goes well I’ll let rip with some other obsessions later.&lt;/p&gt;

&lt;h2 id=&quot;you-need-help&quot;&gt;You need help…&lt;/h2&gt;

&lt;p&gt;Indeedy. Get that thing out your pocket and point it over here. If the layout looks funny on your device, let me know in the comments below.&lt;/p&gt;
</content>
 </entry>
 

</feed>
