You’ve probably heard the term big data in the past several years. As the name might imply, it’s about analyzing a lot of data at once, like an entire laptop full of data, often much more. We’ve all trouble with a single file that refuses to open, is laggy, or the ultimate sin — crashes your computer before saving. So how in the world is anyone able to process data so much larger than that? The answer is software built for this specific task that leverages affordable hardware.
As the cost of both processing power and storage have dropped, big data applications — or abuses, in some cases — have become more feasible and potentially more profitable. In order to leverage these benefits, technologies like the open source Hadoop ecosystem and Spark allow you to connect a bunch of computers together (known as horizontal scaling) to work on a large task. A good indicator of a technology’s popularity is how many companies include them in their tech stack — for Hadoop, it’s a lot.
So instead of upgrading a single computer to something approaching a super computer (called vertical scaling, this time), companies can use software to connect cheap computers together. For some companies, it’s more cost effective to rent computer time and storage from Amazon Web Services (AWS), or other cloud providers. The cloud is a separate explainer, though.