Quantcast
Channel: TechNet Technology News
Viewing all articles
Browse latest Browse all 13502

Developing for HDInsight

$
0
0

Windows Azure HDInsight provides the capability to dynamically provision clusters running Apache Hadoop to process Big Data.  You can find more information here in the initial blog post for this series, and you can click here to get started using it in the Windows Azure portal.  This post enumerates the different ways for a developer to interact with HDInsight, first by discussing the different scenarios and then diving into the variety of capabilities in HDInsight.  As we are built on top of Apache Hadoop, there is a broad and rich ecosystem of tools and capabilities that one can leverage.

In terms of scenarios, as we've worked with customers, there are really two distinct scenarios, authoring jobs where one is using the tool to process big data, and integrating HDInsight with your application where the input and output of jobs are incorporated as part of a larger application architecture.  One key design aspect of HDInsight is the integration with Windows Azure Blob Storage as the default file system.  What this means is that in order to interact with data, you can use existing tools and API's for accessing data in blob storage.  This  blog post goes into more detail on our utilization of Blob Storage.

Within the context of authoring jobs, there is a wide array of tools available.  From a high level, there are a set of tools that are part of the existing Hadoop ecosystem, a set of projects we've built to get .NET developers started with Hadoop, and work we've begun to leverage JavaScript for interacting with Hadoop. 

Job Authoring

Existing Hadoop Tools

As HDInsight leverages Apache Hadoop via the Hortonworks Data Platform, there is a high degree of fidelity with the Hadoop ecosystem.  As such, many capabilities will work “as-is.”  This means that investments and knowledge in any of the following tools will work in HDInsight.  Clusters are created with the following Apache projects for distributed processing:

  • Map/Reduce
    • Map/Reduce is the foundation of distributed processing in Hadoop.  One can write jobs either in Java or can leverage other languages and runtimes through the use of Hadoop Streaming
    • A simple guide to writing Map/Reduce jobs on HDInsight is available here.
  • Hive
    • Hive uses a syntax similar to SQL to express queries that compile to a set of Map/Reduce programs.  Hive has support for many of the constructs that one would expect in SQL (aggregation, groupings, filtering, etc.), and easily parallelizes across the nodes in your cluster.
    • A guide to using Hive is here
  • Pig
    • Pig is a dataflow language that compiles to a series of Map/Reduce programs using a language called Pig Latin. 
    • A guide to getting started with Pig on HDInsight is here
  • Oozie
    • Oozie is a workflow scheduler for managing a directed acyclic graph of actions, where actions can be Map/Reduce, Pig, Hive or other jobs.  You can find more details in the quick start guide here.

You can find an updated list of Hadoop components here.  The table below represents the versions for the current preview:

Apache Hadoop1.0.3
Apache Hive0.9.0
Apache Pig0.9.3
Apache Sqoop1.4.2
Apache Oozie3.2.0
Apache HCatalog0.4.1
Apache Templeton0.1.4


Additionally, other projects in the Hadoop space, such as Mahout (see this sample) or Cascading can easily be used on top of HDInsight.  We will be publishing additional blog post on these topics in the future.

.NET Tooling

We're working to build out a portfolio of tools that allow developers to leverage their skills and investments in .NET to use Hadoop.  These projects are hosted on CodePlex, with packages available from NuGet to author jobs to run on HDInsight.   For instructions on these, please see the getting started pages on the CodePlex site.

Running Jobs

In order to run any of these jobs, there are a few options:

  • Run them directly from the head node.  To do this, RDP to your cluster, open the Hadoop command prompt, and use the command line tools directly
  • Submit them remotely using the REST API's on the cluster (see the following section on integrating HDInsight with your applications for more details)
  • Leverage tools on the HDInsight dashboard.  After you create your cluster, there are a few capabilities in the cluster dashboard for submitting jobs:
    • Create Job
       
    • Interactive Console

Viewing all articles
Browse latest Browse all 13502

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>