You can run MapReduce or Spark jobs directly against S3, using the. Paths prefixed “http://s3://my-bucket/whatever”, assuming the S3 connector is on the classpath, work just like HDFS operations (albeit with weaker consistency guarantees).
HDFS on-premise is great — because everything else sucks at scale — but there are almost no situations where HDFS is the right place to store persistent data in the Cloud:
- S3 (or any other object storage) is way, way cheaper than HDFS. HDFS is only performant with double or triple replication; this is way more expensive than object storage, especially if you transition cold objects to Glacier.
- Scaling down HDFS is arduous, painful, and buggy. DataNode decommissioning uh…. well, I’ve seen it finish a couple times. More often, I got bored, powered off the machine, and hoped that triple replication would clean up after me.
Do you really want to have to scale your HDFS DataNode count up and down with data size? Hint: no, you don’t.
- If you run HDFS, you’re going to be paying for EC2 instances even when you aren’t reading the data. If you power a significant HDFS deployment off and on again, your NameNode is going to take ages (well, up to half an hour) to reload its image from the last checkpoint. Aint nobody got time for that.
Given all this — your usage of Hadoop might vary, but my default recommendation (and the configuration we used when we migrated our 2,500 node Hadoop deployment to GCP) would be to use each of those storage types, but for different layers (we used the GCP equivalents, but I’ll phrase things in AWS-speak):
- Instance Storage (on GCP, “local SSDs”) works great for MapReduce or Spark temporary shuffle space. During a large MapReduce or Spark sort, it’s highly unlikely that your data will all fit in memory. So make sure that MapReduce’s temp directory — where the data will spill — is a fast, SSD, instance store:
- Spills aren’t big data; it’s a small multiple of the output size of your largest job. Instance compute cost is going to dwarf this spending. Any time you spend blocked on IO during sorting (and yes, merging spills is going to absolutely ravage your disks) is probably going to cost you money overall. So just use SSDs.
- You don’t care about replication here, so EBS (volume storage) is overkill. If your cluster dies mid-job, just start the job from scratch. There’s no point in replicating data off-rack if you don’t care about losing it.
- Instance Storage is cheaper than volume storage. Cheaper, faster, better. Win/win/win.
- EBS (volume storage) is good for two things:
- System root volumes. Probably, you won’t have a choice.
- HDFS storage (“What? You just said not to use HDFS!”).
Ok, so technically — you can’t easily run Hadoop without a tiny bit of HDFS for log storage, JobConf uploading, Distributed Cache storage, and a few other fiddly metadata operations. And EBS is nifty here, because it means you can power down the entire cluster and not HDFS “data”.
For staging environments, which you only need to use intermittently, this is a nice cost-saving trick, since you’ll only have a couple tiny DataNodes — the EBS cost is almost irrelevant.
- S3 (object storage) for everything else. Aka, the actual data. Probably 99% of your total storage space.
Object storage is cheap. It is reliable. It’s massively configurable when it comes to backups, regional replication, access controls, etc etc etc. Just use it, and never worry about a HDFS NameNode failover again. Sleep easy, and let some poor AWS engineer get paged at 2am instead.
And one more point, which I truly cannot emphasize enough — the freedom to destroy and recreate your Hadoop cluster on a whim, without losing data, is incredibly liberating. If you put all your important data on S3, you can fix literally every Hadoop problem by nuking your cluster from orbit and starting from scratch.
No more missing blocks. No more corrupt NameNode checkpoints. No more JournalNodes out of sync. No more decommissioning-but-already-dead DataNodes.
Use S3, and use the pure, blissful freedom to be an engineer again, instead of an HDFS babysitter.