Despite the amazing advancements of computing technologies in recent years, processing and displaying large amounts of data dynamically is still a daunting, complex task. However, a smart approach with a good algorithmic foundation can enable things that were considered impossible before.
Let’s see if Mapbox GL JS can handle loading a 106 MB GeoJSON dataset of US ZIP code areas with 33,000+ features shaped by 5.4+ million points directly in the browser (without server support):
Wait, what?! A few seconds loading the data, and you can browse the whole data set smoothly and seamlessly. But how exactly does that work? Let’s find out.
Slicing vector tiles
It isn’t possible to render such a crazy amount of data in its entirety at 60 frames per second, but luckily, we don’t have to:
- at lower zoom levels, shapes don’t need to be as detailed
- at higher zoom levels, a lot of data is off-screen
The best way to optimize the data for all zoom levels and screens is to cut it into vector tiles. Traditionally, this is done on the server, using tools like Mapnik and PostGIS.
Could we create vector tiles on the fly, in the browser? Specifically for this purpose, I wrote a new JavaScript library — geojson-vt.
It turned out to be crazy fast, with its usefulness going way beyond the browser:
- It powers GeoJSON rendering in Mapbox GL JS.
- It was ported to C++14 to power on-the-fly point and polyline rendering on mobile devices with Mapbox Mobile.
- It works on a server with Node.js, so we’re experimenting with using it in our server-side data-processing pipeline.
How GeoJSON-VT works
Preprocessing
Generating tiles involves many computationally expensive steps:
- projecting geographic coordinates into screen coordinates
- optimizing polyline and polygon detail for each zoom level
- filtering out features that are too small for a particular zoom
- clipping data to tile extents
GeoJSON-VT performs these steps in order: it projects the entire dataset and calculates simplification before creating tiles. This way, the creation of many tiles over many zoom levels can be run quickly and without duplication of work.
The first step of preprocessing is projecting all coordinates into
[0..1, 0..1]
range with the Web Mercator projection. Later, we can get the tile screen coordinates with a simple multiplication plus rounding:screen_x
= [tile_size
× (2zoom x
− tile_x
)]screen_y
= [tile_size
× (2zoom y
− tile_y
)]
The next step is calculating simplification data that is later used to reduce detail in each tile. It uses a modified Ramer-Douglas-Peucker simplification algorithm that recursively finds the most “significant” points in a polyline and discards unnecessary detail.
This is the same technique used in Leaflet and simplify-js, but instead of immediately simplifying data, GeoJSON-VT marks each vertex with an importance value, so it can be quickly included or excluded at a zoom level.
Because pixel distances scale linearly with 2zoom, we can run simplification once for the data and use the same values to optimize detail on every zoom level, filtering out unnecessary vertices almost instantly.
Then we calculate the area of each polygon ring and the length of each polyline. Areas and lengths are scaled with zoom level in the same way as distances, which allows us to filter out geometries that are too tiny to display on any particular zoom level using these pre-calculated values in linear time.
Finally, we save the bounding box of each feature, which will allow us to do clipping much faster.
Slicing the data into tiles
Slicing tiles involves clipping data to a square extent, but we can’t afford running a clipping algorithm on the whole dataset for every single tile — that would be too slow.
To make it fast, GeoJSON-VT uses a top-down, divide & conquer approach to clipping:
- Slice the top
z0
tile into fourz1
tiles. - Slice each
z1
tile into fourz2
tiles. - Repeat recursively.
This makes slicing extremely fast, because each tile only needs to clip the features in its parent tile, rather than the entire dataset. But we can do even better than that.
Here’s how the most common rectangle clipping algorithm, Sutherland-Hodgeman, usually works:
To cut a tile out of a parent zoom tile, you need to cut the features by 4 half-planes formed by each of the axis-parallel red lines. For each line, we need to loop through all the points from the previous cut.
Note that the tile cut is not exactly a quarter of the parent tile, because each tile needs to have a small buffer that goes beyond what we display on a map to solve rendering issues (e.g. Mapbox Studio Classic buffers each 512px tile by 8 pixels in each direction by default).
Let’s roughly calculate how long it would take to clip out 4 tiles relative to the amount of data in the parent tile:
4 × (n + ½ n + ¼ n + ¼ n) = 8n
We can cut 4 tiles out of a parent tile in a smarter way. To do that, I made a simple modification of the half-plane clipping algorithm, making it clip shapes between two parallel-axis lines in one go: cutting out a “stripe”. This way, we can cut two halves with two vertical stripes first, then cut 2 tiles out of each half with two horizontal stripes:
Let’s see how this compares to the previous approach:
n + n + 4 × ½ n = 4n
So we now have roughly 2 times faster clipping of each tile.
Finally, we can use the feature bounding boxes to our advantage. Imagine cutting the sample ZIP code data above into tiles on low zoom levels — for each tile, most zip code shapes will be either trivial accepts — included completely without clipping — or trivial rejects — throw away. If we have bounding boxes pre-calculated, we can handle trivial cases like this without rebuilding all geometries point-by-point, speeding everything up considerably.
Slicing on the fly
To serve vector tiles on the fly without a perceptible delay, ideally, we would generate all the vector tiles up to max zoom into cache, and then just serve the tiles instantly when requested. This may be fine for small datasets, but for anything bigger, the tile slicing routine will eventually run out of memory. On zoom level 15 alone, there are potentially 415 = ~1 billion tiles — that’s just too much to handle, especially for the browser.
So we can’t slice tiles to the max zoom on initial data load, but we can slice them up to a point where subsequent on-demand slicing takes so little time that the loading delay is imperceptible. GeoJSON-VT tackles this in several ways:
- Initially, the tiles are generated to certain zoom level, leaving everything else for on-demand slicing.
- When slicing tiles, we keep track of the number of points combined in every tile. Initial slicing stops when a tile goes below a certain threshold (100,000 points by default). If a tile contains few geometries, then its children can be generated quickly as needed.
- When drilling down to a particular tile, GeoJSON-VT only slices one tile into 4 children tiles per zoom through the zoom range, caching all sliced tiles. This way drilling down to neighboring tiles is very fast because a much smaller parent tile will be cached from the previous tile drill-down.
I found the initial slicing defaults (z5 and 100k points) to be the sweet spot for most sample datasets I tested the algorithm on — initial processing is relatively fast, while subsequent drill-down is pretty much instant.
We also need to minimize the amount of memory used in each step of the algorithm. This is achieved by several things:
- The original geometry is only kept in the highest-zoom tiles and is thrown away once a tile has been cut into 4 tiles of the next zoom level, keeping only the simplified data. This way the memory needed to store all the tiling data doesn’t grow exponentially when slicing deeper.
- All recursion is eliminated with iterative algorithms. Both simplification and tile slicing (recursive in nature) are implemented using a simple array queue processed in a loop. This avoids the situation where geometry that you no longer need gets piled up in call stacks: it’s garbage-collected properly.
- Transforming projected
(0..1, 0..1)
coordinates into screen coordinates always happens on demand, because building new arrays takes a lot of memory if done for all tiles as soon as they are sliced. - Solid squares filled to the tile extent (tiles fully inside a polygon) are never sliced further, because all of its children are identical to the parent regardless of zoom level.
Debugging and performance profiling
To debug and optimize the library, I wrote a simple debug page where you can drag and drop a GeoJSON file onto a barebones map where it is processed by GeoJSON-VT:
The library API also has a
debug
option with 2 logging levels, doing various levels of performance timings and logging during the tiling when specified. The GIF above demonstrates level 1, with level 2 outputting debug info for each individual tile. This helped tackling all the bottlenecks and tracking issues tremendously.Jank-free rendering in Mapbox GL JS
Despite the library being extremely fast, loading and processing big amounts of data still takes a significant amount of CPU time.
Browsers normally run JavaScript code in the same thread that handle all UI updates and user interaction. If JavaScript is busy doing heavy computation, the page will freeze and won’t accept any user input.
To avoid lags and freezes when loading and processing data, Mapbox GL JS utilizes Web Workers, a modern browser feature that permits running computationally-intensive code in the background without interrupting the main thread.
There’s one more problem though — transferring big amounts of data between a web worker and the main thread usually involves copying the data in memory, also causing the main thread to freeze.
To avoid that bottleneck, we take advantage of another HTML5 feature — Transferable Objects. It allows transferring data between threads instantly with zero copy if you pass the data in the form of typed arrays. This allows us to provide a smooth, jank-free experience despite plenty of heavy CPU work happening in the background.
Using the library
If you’re using Mapbox GL-based tools (either GL JS or Mapbox Mobile), you’re already using GeoJSON-VT under the hood.
You can use it manually for other applications too — all the complexity of the processing is hidden behind a very simple API:
var tileIndex = geojsonvt(data);
...
var tile = tileIndex.getTile(z, x, y);
The resulting tile is a simple JSON equivalent of Vector Tile, and can be rendered with ease.
Try it! We’re curious to see what other creative uses of the library you can come up with. I also encourage you to check out the geojson-vt repository and read through the code — it’s was carefully written to be as simple and clean as possible.
Thanks for bearing with me in this long technical article, and don’t hesitate to hit me up on Twitterif you have any questions or comments.
No comments:
Post a Comment