Amazon Web Services Inc. (AWS), announced that its Big Data processing service now supports additional tools from the open-source Apache ecosystem.
According to a recent AWS blog post, the Amazon Elastic MapReduce Web service now supports Apache Tez. This is described by AWS as a dataflow-driven data processing task orchestration. Apache Phoenix provides “fast SQL for OLTP, operational analytics” according to AWS.
Amazon EMR offers managed Hadoop for Big Data processing using Amazon EC2 compute instances. It also supports many open source projects in Hadoop ecosystem such as the Apache Spark project. With the addition of Phoenix and Tez, this portfolio has been expanded.
AWS announced that “Tez runs on Apache Hadoop YARN.” “Tez provides a set APIs for dataflow definition that allows you to define a DAG, Directed Acyclic graph (DAG) of data processing tasks. Tez is faster than Hadoop MapReduce and can be used with both Hive or Pig.
AWS stated that Phoenix uses HBase (another member the Hadoop ecosystem), as its datastore. You can connect to Phoenix either using the JDBC driver that is included on the cluster, or other applications that are running on the cluster. You can access fast, low-latency SQL and full ACID transaction capabilities in either way. Your SQL queries are combined into a series HBase scans. The scans are run in parallel and the results are then aggregated to create the result set.
AWS also updated several Big Data apps, including HBase 1.2.1 and Mahout 0.12.0. Presto 0.147 was also updated by AWS. Redshift JDBC driver for data stored on clusters within the Redshift data warehouse service is also available.
