SlideShare today announced the biggest change since we started. We are now rendering presentations and documents using HTML5 instead of Flash. This is a milestone. 5 years ago, it was impossible to build something like SlideShare or Youtube without Flash. But the web has finally caught up.
This project was the biggest engineering project in SlideShare’s history. A lot of SlideShare engineering has been working on this around-the-clock for the last six months. As we have learnt over the past five years, people are picky about how their presentations look. Getting the fonts and the text placement to look exactly right across all supported browsers was a real engineering challenge. So we’re happy to finally be able to see this on SlideShare.net.
Ditching Flash for HTML5 feels like the right choice for us for a number of engineering reasons.
Font handling was the biggest challenge. We had to build support for rendering arbitrary fonts in your browser that are not available on the client. If you invent a new font, and upload a pdf that uses it, it should still render perfectly on SlideShare. Whoa!
Placing the text is very tricky due to differences between different browsers, differences between fonts (handling ligature), and several other complexities. To illustrate: the PDF coordinate system starts in the bottom left. HTML starts in the top left. Pdfs use points, HTML you get your choice of unit, however no two browsers agree on how precise any particular unit is! The largest problem we face with placement is normalization. We spent a lot of time finding that magic combination of em’s, percentages and zoom which gives us correct placement across the web.
We also built a system to find out when there is variance between an image of the HTML output and an image generated directly from the document. If there’s more than a certain amount of variance, we consider that an error and we won’t serve that page as HTML5. Instead we’ll serve a png image of the page when that page is requested. There was some hard-core computer vision involved in the error-handling system. The way we look at it, we want to serve HTML5, but not at the expense of a document that looks bad and disappoints the author.
Our conversion stack runs on Amazon EC2 and is configured and managed by Puppet. We’ve been using EC2 for our conversion stack for years, so we’re old hands at that stuff. For this new system, we started out with a number of different types of servers (a font extractor, a font generator, etc). What we found out is that the coordination time between different machines (using Amazon SQS) and the IO time (using S3) were a huge bottleneck. So our architecture for this new system is more remenicent of the netflix “Rambo” architecture. Each box is a self-contained system that can do the entire job of conversion, with no help from anyone.
As we speak, an army of hundreds of Amazon EC2 instances is crunching away at converting the *millions* and *millions* of presentations and documents that have been uploaded to slideshare over the last 5 years to HTML5. New documents will automatically be converted to HTML5 from now on. We hope to have the transition complete by the end of the year (maybe sooner, but no promises!). At that point all slideshare content will be served as Html5.
Source Sho Tools