|
||||||||||
PREV NEXT | FRAMES NO FRAMES |
See:
Description
Core | |
---|---|
org.apache.hadoop | |
org.apache.hadoop.classification | |
org.apache.hadoop.conf | Configuration of system parameters. |
org.apache.hadoop.fs | An abstract file system API. |
org.apache.hadoop.fs.ftp | |
org.apache.hadoop.fs.kfs | A client for the Kosmos filesystem (KFS) |
org.apache.hadoop.fs.permission | |
org.apache.hadoop.fs.s3 | A distributed, block-based implementation of FileSystem that uses Amazon S3
as a backing store. |
org.apache.hadoop.fs.s3native |
A distributed implementation of FileSystem for reading and writing files on
Amazon S3. |
org.apache.hadoop.io | Generic i/o code for use when reading and writing data to the network, to databases, and to files. |
org.apache.hadoop.io.compress | |
org.apache.hadoop.io.file.tfile | |
org.apache.hadoop.io.nativeio | |
org.apache.hadoop.io.serializer | This package provides a mechanism for using different serialization frameworks in Hadoop. |
org.apache.hadoop.io.serializer.avro | This package provides Avro serialization in Hadoop. |
org.apache.hadoop.log | |
org.apache.hadoop.metrics | This package defines an API for reporting performance metric information. |
org.apache.hadoop.metrics.file | Implementation of the metrics package that writes the metrics to a file. |
org.apache.hadoop.metrics.ganglia | Implementation of the metrics package that sends metric data to Ganglia. |
org.apache.hadoop.metrics.spi | The Service Provider Interface for the Metrics API. |
org.apache.hadoop.net | Network-related classes. |
org.apache.hadoop.record | (DEPRECATED) Hadoop record I/O contains classes and a record description language translator for simplifying serialization and deserialization of records in a language-neutral manner. |
org.apache.hadoop.record.compiler | (DEPRECATED) This package contains classes needed for code generation from the hadoop record compiler. |
org.apache.hadoop.record.compiler.ant | |
org.apache.hadoop.record.compiler.generated | (DEPRECATED) This package contains code generated by JavaCC from the Hadoop record syntax file rcc.jj. |
org.apache.hadoop.record.meta | |
org.apache.hadoop.security | |
org.apache.hadoop.security.authorize | |
org.apache.hadoop.security.token | |
org.apache.hadoop.security.token.delegation | |
org.apache.hadoop.util | Common utilities. |
org.apache.hadoop.util.bloom | |
org.apache.hadoop.util.hash |
contrib: FailMon | |
---|---|
org.apache.hadoop.contrib.failmon |
Hadoop is a distributed computing platform.
Hadoop primarily consists of the Hadoop Distributed FileSystem (HDFS) and an implementation of the Map-Reduce programming paradigm.
Hadoop is a software framework that lets one easily write and run applications that process vast amounts of data. Here's what makes Hadoop especially useful:
If your platform does not have the required software listed above, you will have to install it.
For example on Ubuntu Linux:
$ sudo apt-get install ssh
$ sudo apt-get install rsync
On Windows, if you did not install the required software when you installed cygwin, start the cygwin installer and select the packages:
First, you need to get a copy of the Hadoop code.
Edit the file conf/hadoop-env.sh to define at least JAVA_HOME.
Try the following command:
bin/hadoopThis will display the documentation for the Hadoop command script.
By default, Hadoop is configured to run things in a non-distributed mode, as a single Java process. This is useful for debugging, and can be demonstrated as follows:
mkdir inputThis will display counts for each match of the regular expression.
Note that input is specified as a directory containing input files and that output is also specified as a directory where parts are written.
(We also set the HDFS replication level to 1 in order to reduce warnings when running on a single node.)
Now check that the command
ssh localhost
does not
require a password. If it does, execute the following commands:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
A new distributed filesystem must be formatted with the following command, run on the master node:
bin/hadoop namenode -format
The Hadoop daemons are started with the following command:
bin/start-all.sh
Daemon log output is written to the logs/ directory.
Input files are copied into the distributed filesystem as follows:
bin/hadoop fs -put input input
Things are run as before, but output must be copied locally to examine it:
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'When you're done, stop the daemons with:
bin/stop-all.sh
Fully distributed operation is just like the pseudo-distributed operation described above, except, specify:
Finally, list all slave hostnames or IP addresses in your conf/slaves file, one per line. Then format your filesystem and start your cluster on your master node, as above.
|
||||||||||
PREV NEXT | FRAMES NO FRAMES |