Skip to main content

Working with HDFS - Hadoop Distributed file system

Architecture:

The concept of HDFS is to split the file into blocks and save multiple copies of the blocks on different nodes. The advantage of doing this is multiple operations can be done on the blocks and later the results can be aggregated , also having multiple copies will address the fault tolerance.

HDFS will have the following nodes

1.) Name node - This will store the metadata of the file like how many blocks the file is split into and which nodes have this copies. This will also decide how to split the file and where to store these blocks

2.) Secondary Name node - This is for the fail over of the Name node.

3.) Data Node  - These can be any in number depending on the requirement. These nodes will actually store the data . These nodes will send heart beat to the Name node periodically.

In a clustered hadoop environment , a node would be on a physical machine. In a pseudo cluster environment ( like the cloudera VM) all the nodes would run on the same physical machine.


How to start & stop the hdfs services:

On launching the Cloudera VM , all the hadoop services would be started by default. In case you are not able to access HDFS , use the below commands to start HDFS services.

Start HDFS services :

To start HDFS: On the NameNode: sudo service hadoop-hdfs-namenode start
On the Secondary NameNode (if used): sudo service hadoop-hdfs-secondarynamenode start
On each DataNode: sudo service hadoop-hdfs-datanode start

In case you want to stop the hdfs services , use the below commands 

Stop HDFS services :

To stop HDFS: On the NameNode: sudo service hadoop-hdfs-namenode stop
On the Secondary NameNode (if used): sudo service hadoop-hdfs-secondarynamenode stop
On each DataNode: sudo service hadoop-hdfs-datanode stop

Below are the commands to work with hdfs :

1.) To list the files in hdfs

hadoop fs -ls /
hadoop fs -ls /user
hadoop fs -ls /user/cloudera

2.) To put a file in hdfs :

create a file testhdfs.txt in the local file system

gedit testhdfs.txt 
enter some random data in this file and save the file

hadoop fs -put testhdfs.txt /user/cloudera
This will put the file to the hdfs from the local file system

hadoop fs -ls /user/cloudera 
This will list the file which was added

3.) To view the file in hdfs :

hadoop fs -cat /user/cloudera/testhdfs.txt

hadoop fs -tail /user/cloudera/testhdfs.txt


4.) To change the file permissions

hadoop fs -chmod 777 /user/cloudera/testhdfs.txt

5.) To remove the file from hdfs 

hadoop fs -rm /user/cloudera/testhdfs.txt

Note : The file will still be available in your local file system.

File Explorer :

Hadoop provides a browser based file explorer. This helps us to browse through the hadoop file system without hadoop commands. Below is the URL for hadoop file explorer.

http://localhost:50070/explorer.html

Lot of things related to the blocks size , storage locations , data replication can be explored from the file explorer.














Change Replication and Block size :

Below file will have the configurations related to hadoop 

/usr/lib/hadoop/etc/hadoop/hdfs-site.xml

or

/etc/hadoop/conf/hdfs-site.xml

Following properties can be altered -

- Block size: (For 256 MB blocks)

<property>
    <name>dfs.block.size</name>
    <value>268435456</value>
  </property>

- Replication:

<property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>

If above properties are not existing in the file you can add them.

Comments

Post a Comment

Popular posts from this blog

Let us 'Sqoop' it ! .

SQOOP - The bridge between traditional and novel big data systems. By now,we have seen articles about MapReduce programs to write programs using Hadoop MapReduce framework.However, all the operations were actually performed on sample text files. Here comes Sqoop to our rescue.Apache Sqoop is a tool developed to fetch/put data from traditional SQL based data storage systems like MySQL,PostgreSQL,MS SQL Server.Sqoop can also be used to fetch/push data from NoSQL systems too.This versatility is because of Sqoop's architecture abstracted on MapReduce framework. Sqoop has an extension framework that makes it possible from and to any external storage system that has bulk data transfer capabilities.A Sqoop connector is a modular component that uses this framework to enable Sqoop imports and exports.Sqoop comes with connectors for working with a range of versatile popular databases including MySQL,PostgreSQL,Oracle,SQL Server,DB2 and Netezza.Apart from the above connectors Sqoop als...

Cloudera setup

Installing Cloudera is a best way to kick start the cloud setup. Follow the below steps to setup Cloudera on your windows machine: 1) Download VMware player to open cloudera machine from your windows machine link :  https://www.vmware.com/products/player/playerpro-evaluation.html Install VMWare player. 2.) Download the Cloudera VM. Do the signup and stuff required to download cloudera VM. link :  http://www.cloudera.com/downloads.html Click on quick starts from the above link , select the latest version and VMWare and click on download. Approximately 5GB of data would be downloaded. So sit back and relax . Upon completion of Clodera VM download , extract the downloaded zip file to a convenient location. Launching the VM 1.) Open the VMWare player and click on open a virtual machine . Open the VM from the path where you have extracted the ClouderaVM .                               ...

Hive Example

Use Case : A super market would record the sales in a file . Whenever an item is sold , the name of item , number of units of sale and cost of each unit in a comma separated format. A sample file would look like below Apple,10,10 Mango,20,5 Guava,10,3 Banana,30,4 Apple,10,5 At the end of the day we are required to find  the total sales per each item. Expected Output : Apple 150 Mango 100 Guava 30 Banana 120 Implementing in HIVE Getting started with HIVE: Open a terminal and type hive , this will open the hive shell Create and use sales database : Create database : create database salesdb; Use the database : use salesdb; Create sales table: CREATE TABLE ITEM_SALES_RECORD ( ITEM_NAME string, UNITS int, UNIT_COST decimal)  ROW FORMAT DELIMITED  FIELDS TERMINATED BY ","  LINES TERMINATED BY "\n"; NOTE : Table names and column names are not case sensitive. Insert data into table from file: Use the java file to gener...