Microsoft recently made some big announcements around the SQL Server platform which contain some really interesting thinking on data structure, cloud, and visibility. ReadWrite Web explores the implications.
Of all the software technologies that are least suited to getting a makeover that’s “about the experience,” you’d think databases would rank pretty far down. The database experience, if there is one, is typically about accuracy, reliability, and speed. Certainly Oracle’s frequent measurements (“5x,” “10x,” “20x” and so on) are all about those metrics.
But Microsoft has found an angle with respect to SQL Server 2012, the second round of announcements for which came this morning in Seattle. The new angle starts with multitouch, but then it runs deeper, touching on the larger problem of data getting fragmented and redundantly duplicated as it gets used and visualized.
On the surface, it would be nicer for businesses to have a more flexible, adaptive system for visualizing large amounts of data and making correlative analyses. At a deeper level, the capability to share and collaborate by way of this visualization system could reduce, if not eliminate altogether, all the unnecessary duplication and versioning that comes from embedding data in e-mails. That’s the two stage goal described today for Microsoft Data Explorer, a touch-sensitive visualization tool that will orbit around the new SQL Server ecosystem.
Perhaps the most commonly used tool for visualizing data in an organization today is Excel. And as Quentin Clark, Microsoft’s own corporate vice president for database systems, tells RWW, that’s not really a good thing. When enterprise users see data from a large database in Excel, they show it to the workforce by cutting and pasting it into an e-mail message. Object Linking and Embedding, one of Microsoft’s oldest Windows technologies, gets cut out of the loop here. Inevitably, recipients save the spreadsheet fragment as an attachment.
Then employees make suggested changes and revisions. Suddenly versioning gets called into play. Fragments of the database abound, none of it in the database. Projections are made based on one or the other fragment. Those projections are loaded back into Excel. Charts are made from those projections, and get pasted into PowerPoint. Tools from PowerPoint and OneNote make further suggested revisions. And none of this involves the use of the original database or its own tools, beyond that initial cut-and-paste job.
“It’s not so much that we don’t want Excel in the loop,” says Clark about a product his team doesn’t manage. “We want to give IT tools to make data sharable, so [people] can stop using Excel as the way data and information gets flowed, over e-mail… [and] ensure that SharePoint can be used as a sharing and collaboration mechanism, as opposed to e-mail.”
Microsoft’s other hope is that Data Explorer will introduce more users to a concept that debuted with SQL Server 2008 R2. It currently goes by the name Managed Self-Service Business Intelligence, which already doesn’t sound like something you’d want to gift-wrap as an anniversary present. But it merges the company’s business intelligence (BI) tools with some of the visualization techniques that emerged from project “Crescent” and then became Data Explorer, with hooks that enable sharing views of the data through SharePoint without duplicating the data.
Does the cloud change the data?
Last week at Oracle OpenWorld, CEO Larry Ellison introduced his audience to several elements of “magic” that he explained could change the texture and behavior of databases as we know them (if they haven’t done so already). One of them involves the use of columnar storage, which twists the order in which the contents of tables are written by 90 degrees.
Imagine in your mind a data table. The fields in tables are always written to a storage device in some sequence. Usually that sequence is record by record (“1,” “David,” “Strom,” “2,” “Joe,” “Brockmeier,” “3,” “Scott, “Fulton…”). Columnar data changes this order (“1,” “2,” “3,” “David,” “Joe,” “Scott,” “Strom,” “Brockmeier,” “Fulton…”). This makes it a bit more difficult for query engines to draw analytical conclusions on the data. But in cases where the objective of a query is to read large quantities of fields in a column, if not the entire column, it expedites the process tremendously.
Depending upon the scale of the database and the types of analysis that need to be performed, the speed gain from changing to columnar could more than offset the speed loss in transaction processing and in write processes. Now, SQL Server 2012 enables a shift to columnar storage for selected tables.
“Instead of compressing things by rows and pages, this [technique] looks at a large set of data,” explains Clark. “Because these data warehousing tables have a lot of repeating data, when you look down columns, you can take advantage of that.” The technology was first introduced for PowerPivot for Excel during SS 2008 R2’s reign, using an alternative storage engine that flipped the contents of very large worksheet tables to one side. Once flipped, it became easier to compress columns with multiple, repeated fields. Imagine a personnel directory, for instance, where there are multiple “Smiths.” It’s much more efficient to compress “Smith,” “Smith,” “Smith…” than a table full of probably unique records adjacent to one another.
“Unlike a lot of our competitors’ columnar accelerators, which are separate products that have to be managed separately, we’re building an index type into the database. So there’s no separate thing to manage… You configure it to be indexed this way, and off you go. It’s a configuration operation, not a development or architectural operation.”
The good kind of fragmentation
Databases are getting larger, in large part (as Microsoft’s Doug Leland explained yesterday) due to the shifting ratio of machine-generated data to human-generated data. The NoSQL movement is grounded on the principle that data structures that make ordinary databases transactional and query-ready serve to stifle scalability.
But Microsoft’s Quentin Clark disagrees with the idea that removing structure from data enables performance increases in all scenarios. As an alternative, Microsoft announced today a sharding technique, where partitions of complete records are divided horizontally (as opposed to by column) and stored in separate regions. The pattern being utilized for programming for “shard-ability” is being called “Federation” (for now), and will apply to SQL Azure-hosted databases.
As Clark explains to us, when databases are designed for the appropriate constructs, they will automatically benefit from sharding in ways that are transparent to the customer. “You’re inherently programming into a partition form, which is what a lot of these new SQL databases do, so you get that scalability and performance built in to help you manage the shards, know where the hotspots are, move them around, split them, and so on.
“It has to do with the programming pattern; it’s not really a performance issue,” Clark continues. “It’s really about how fast you get moving, and how sure you are of scaling once you’ve programmed to a certain pattern.”
Just over the past few years of Microsoft customer briefings in which Clark has taken part, he tells us, the polarity of the cloud discussion has shifted. Whereas he used to have to bring the topic up himself (“Have you thought about the cloud?”), today attendees won’t let him talk as long as ten minutes without forcing the topic on him. “The conversation has shifted with customers,” he remarks. “One of the unique things we bring to the table is our collection of products — whether on-premise, or appliances, or from the cloud — that allows you to have choice… You develop one thing, and then you get deployment freedom from that.”