Hadoop学习总结之二：HDFS读写过程解析2(zz)_xieys19860101_新浪博客

新浪博客

加载中…

http://blog.sina.com.cn/u/1899295813

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

Hadoop学习总结之二：HDFS读写过程解析2(zz)

(2011-03-16 20:31:21)

标签：

杂谈

from: http://www.cnblogs.com/forfuture1978/archive/2010/11/10/1874222.html

三、文件的写入

下面解析向hdfs上传一个文件的过程。

3.1、客户端

上传一个文件到hdfs，一般会调用DistributedFileSystem.create，其实现如下：

public FSDataOutputStream create(Path f, FsPermission permission,

boolean overwrite,

int bufferSize, short replication, long blockSize,

Progressable progress) throws IOException {

return new FSDataOutputStream

(dfs.create(getPathName(f), permission,

overwrite, replication, blockSize, progress, bufferSize),

statistics);

}

其最终生成一个FSDataOutputStream用于向新生成的文件中写入数据。其成员变量dfs的类型为DFSClient，DFSClient的create函数如下：

public OutputStream create(String src,

FsPermission permission,

boolean overwrite,

short replication,

long blockSize,

Progressable progress,

int buffersize

) throws IOException {

checkOpen();

if (permission == null) {

permission = FsPermission.getDefault();

}

FsPermission masked = permission.applyUMask(FsPermission.getUMask(conf));

OutputStream result = new DFSOutputStream(src, masked,

overwrite, replication, blockSize, progress, buffersize,

conf.getInt("io.bytes.per.checksum", 512));

leasechecker.put(src, result);

return result;

}

其中构造了一个DFSOutputStream，在其构造函数中，同过RPC调用NameNode的create来创建一个文件。
当然，构造函数中还做了一件重要的事情，就是streamer.start()，也即启动了一个pipeline，用于写数据，在写入数据的过程中，我们会仔细分析。

DFSOutputStream(String src, FsPermission masked, boolean overwrite,

short replication, long blockSize, Progressable progress,

int buffersize, int bytesPerChecksum) throws IOException {

this(src, blockSize, progress, bytesPerChecksum);

computePacketChunkSize(writePacketSize, bytesPerChecksum);

try {

namenode.create(

src, masked, clientName, overwrite, replication, blockSize);

} catch(RemoteException re) {

throw re.unwrapRemoteException(AccessControlException.class,

QuotaExceededException.class);

}

streamer.start();

}

3.2、NameNode

NameNode的create函数调用namesystem.startFile函数，其又调用startFileInternal函数，实现如下：

private synchronized void startFileInternal(String src,

PermissionStatus permissions,

String holder,

String clientMachine,

boolean overwrite,

boolean append,

short replication,

long blockSize

) throws IOException {

......

//创建一个新的文件，状态为under construction，没有任何data block与之对应

long genstamp = nextGenerationStamp();

INodeFileUnderConstruction newNode = dir.addFile(src, permissions,

replication, blockSize, holder, clientMachine, clientNode, genstamp);

......

}

3.3、客户端

下面轮到客户端向新创建的文件中写入数据了，一般会使用FSDataOutputStream的write函数，最终会调用DFSOutputStream的writeChunk函数：

按照hdfs的设计，对block的数据写入使用的是pipeline的方式，也即将数据分成一个个的package，如果需要复制三分，分别写入DataNode 1, 2, 3，则会进行如下的过程：

首先将package 1写入DataNode 1 然后由DataNode 1负责将package 1写入DataNode 2，同时客户端可以将pacage 2写入DataNode 1 然后DataNode 2负责将package 1写入DataNode 3, 同时客户端可以讲package 3写入DataNode 1，DataNode 1将package 2写入DataNode 2 就这样将一个个package排着队的传递下去，直到所有的数据全部写入并复制完毕

protected synchronized void writeChunk(byte[] b, int offset, int len, byte[] checksum)

throws IOException {

//创建一个package，并写入数据

currentPacket = new Packet(packetSize, chunksPerPacket,

bytesCurBlock);

currentPacket.writeChecksum(checksum, 0, cklen);

currentPacket.writeData(b, offset, len);

currentPacket.numChunks++;

bytesCurBlock += len;

//如果此package已满，则放入队列中准备发送

if (currentPacket.numChunks == currentPacket.maxChunks ||

bytesCurBlock == blockSize) {

......

dataQueue.addLast(currentPacket);

//唤醒等待dataqueue的传输线程，也即DataStreamer

dataQueue.notifyAll();

currentPacket = null;

......

}

}

DataStreamer的run函数如下：

public void run() {

while (!closed && clientRunning) {

Packet one = null;

synchronized (dataQueue) {

//如果队列中没有package，则等待

while ((!closed && !hasError && clientRunning

&& dataQueue.size() == 0) || doSleep) {

try {

dataQueue.wait(1000);

} catch (InterruptedException e) {

}

doSleep = false;

}

try {

//得到队列中的第一个package

one = dataQueue.getFirst();

long offsetInBlock = one.offsetInBlock;

//由NameNode分配block，并生成一个写入流指向此block

if (blockStream == null) {

nodes = nextBlockOutputStream(src);

response = new ResponseProcessor(nodes);

response.start();

}

ByteBuffer buf = one.getBuffer();

//将package从dataQueue移至ackQueue,等待确认

dataQueue.removeFirst();

dataQueue.notifyAll();

synchronized (ackQueue) {

ackQueue.addLast(one);

ackQueue.notifyAll();

}

//利用生成的写入流将数据写入DataNode中的block

blockStream.write(buf.array(), buf.position(), buf.remaining());

if (one.lastPacketInBlock) {

blockStream.writeInt(0); //表示此block写入完毕

}

blockStream.flush();

} catch (Throwable e) {

}

}

......

}

其中重要的一个函数是nextBlockOutputStream，实现如下：

private DatanodeInfo[] nextBlockOutputStream(String client) throws IOException {

LocatedBlock lb = null;

boolean retry = false;

DatanodeInfo[] nodes;

int count = conf.getInt("dfs.client.block.write.retries", 3);

boolean success;

do {

......

//由NameNode为文件分配DataNode和block

lb = locateFollowingBlock(startTime);

block = lb.getBlock();

nodes = lb.getLocations();

//创建向DataNode的写入流

success = createBlockOutputStream(nodes, clientName, false);

......

} while (retry && --count >= 0);

return nodes;

}

locateFollowingBlock中通过RPC调用namenode.addBlock(src, clientName)函数

3.4、NameNode

NameNode的addBlock函数实现如下：

public LocatedBlock addBlock(String src,

String clientName) throws IOException {

LocatedBlock locatedBlock = namesystem.getAdditionalBlock(src, clientName);

return locatedBlock;

}

FSNamesystem的getAdditionalBlock实现如下：

public LocatedBlock getAdditionalBlock(String src,

String clientName

) throws IOException {

long fileLength, blockSize;

int replication;

DatanodeDescriptor clientNode = null;

Block newBlock = null;

......

//为新的block选择DataNode

DatanodeDescriptor targets[] = replicator.chooseTarget(replication,

clientNode,

null,

blockSize);

......

//得到文件路径中所有path的INode，其中最后一个是新添加的文件对的INode，状态为under construction

INode[] pathINodes = dir.getExistingPathINodes(src);

int inodesLen = pathINodes.length;

INodeFileUnderConstruction pendingFile = (INodeFileUnderConstruction)

pathINodes[inodesLen - 1];

//为文件分配block, 并设置在那写DataNode上

newBlock = allocateBlock(src, pathINodes);

pendingFile.setTargets(targets);

......

return new LocatedBlock(newBlock, targets, fileLength);

}

3.5、客户端

在分配了DataNode和block以后，createBlockOutputStream开始写入数据。

private boolean createBlockOutputStream(DatanodeInfo[] nodes, String client,

boolean recoveryFlag) {

//创建一个socket，链接DataNode

InetSocketAddress target = NetUtils.createSocketAddr(nodes[0].getName());

s = socketFactory.createSocket();

int timeoutValue = 3000 * nodes.length + socketTimeout;

s.connect(target, timeoutValue);

s.setSoTimeout(timeoutValue);

s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);

long writeTimeout = HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length +

datanodeWriteTimeout;

DataOutputStream out = new DataOutputStream(

new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout),

DataNode.SMALL_BUFFER_SIZE));

blockReplyStream = new DataInputStream(NetUtils.getInputStream(s));

//写入指令

out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );

out.write( DataTransferProtocol.OP_WRITE_BLOCK );

out.writeLong( block.getBlockId() );

out.writeLong( block.getGenerationStamp() );

out.writeInt( nodes.length );

out.writeBoolean( recoveryFlag );

Text.writeString( out, client );

out.writeBoolean(false);

out.writeInt( nodes.length - 1 );

//注意，次循环从1开始，而非从0开始。将除了第一个DataNode以外的另外两个DataNode的信息发送给第一个DataNode, 第一个DataNode可以根据此信息将数据写给另两个DataNode

for (int i = 1; i < nodes.length; i++) {

nodes[i].write(out);

}

checksum.writeHeader( out );

out.flush();

firstBadLink = Text.readString(blockReplyStream);

if (firstBadLink.length() != 0) {

throw new IOException("Bad connect ack with firstBadLink " + firstBadLink);

}

blockStream = out;

}

客户端在DataStreamer的run函数中创建了写入流后，调用blockStream.write将数据写入DataNode

3.6、DataNode

DataNode的DataXceiver中，收到指令DataTransferProtocol.OP_WRITE_BLOCK则调用writeBlock函数：

private void writeBlock(DataInputStream in) throws IOException {

DatanodeInfo srcDataNode = null;

//读入头信息

Block block = new Block(in.readLong(),

dataXceiverServer.estimateBlockSize, in.readLong());

int pipelineSize = in.readInt(); // num of datanodes in entire pipeline

boolean isRecovery = in.readBoolean(); // is this part of recovery?

String client = Text.readString(in); // working on behalf of this client

boolean hasSrcDataNode = in.readBoolean(); // is src node info present

if (hasSrcDataNode) {

srcDataNode = new DatanodeInfo();

srcDataNode.readFields(in);

}

int numTargets = in.readInt();

if (numTargets < 0) {

throw new IOException("Mislabelled incoming datastream.");

}

//读入剩下的DataNode列表，如果当前是第一个DataNode，则此列表中收到的是第二个，第三个DataNode的信息，如果当前是第二个DataNode，则受到的是第三个DataNode的信息

DatanodeInfo targets[] = new DatanodeInfo[numTargets];

for (int i = 0; i < targets.length; i++) {

DatanodeInfo tmp = new DatanodeInfo();

tmp.readFields(in);

targets[i] = tmp;

}

DataOutputStream mirrorOut = null; // stream to next target

DataInputStream mirrorIn = null; // reply from next target

DataOutputStream replyOut = null; // stream to prev target

Socket mirrorSock = null; // socket to next target

BlockReceiver blockReceiver = null; // responsible for data handling

String mirrorNode = null; // the name:port of next target

String firstBadLink = ""; // first datanode that failed in connection setup

try {

//生成一个BlockReceiver, 其有成员变量DataInputStream in为从客户端或者上一个DataNode读取数据，还有成员变量DataOutputStream mirrorOut，用于向下一个DataNode写入数据，还有成员变量OutputStream out用于将数据写入本地。

blockReceiver = new BlockReceiver(block, in,

s.getRemoteSocketAddress().toString(),

s.getLocalSocketAddress().toString(),

isRecovery, client, srcDataNode, datanode);

// get a connection back to the previous target

replyOut = new DataOutputStream(

NetUtils.getOutputStream(s, datanode.socketWriteTimeout));

//如果当前不是最后一个DataNode，则同下一个DataNode建立socket连接

if (targets.length > 0) {

InetSocketAddress mirrorTarget = null;

// Connect to backup machine

mirrorNode = targets[0].getName();

mirrorTarget = NetUtils.createSocketAddr(mirrorNode);

mirrorSock = datanode.newSocket();

int timeoutValue = numTargets * datanode.socketTimeout;

int writeTimeout = datanode.socketWriteTimeout +

(HdfsConstants.WRITE_TIMEOUT_EXTENSION * numTargets);

mirrorSock.connect(mirrorTarget, timeoutValue);

mirrorSock.setSoTimeout(timeoutValue);

mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);

//创建向下一个DataNode写入数据的流

mirrorOut = new DataOutputStream(

new BufferedOutputStream(

NetUtils.getOutputStream(mirrorSock, writeTimeout),

SMALL_BUFFER_SIZE));

mirrorIn = new DataInputStream(NetUtils.getInputStream(mirrorSock));

mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );

mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK );

mirrorOut.writeLong( block.getBlockId() );

mirrorOut.writeLong( block.getGenerationStamp() );

mirrorOut.writeInt( pipelineSize );

mirrorOut.writeBoolean( isRecovery );

Text.writeString( mirrorOut, client );

mirrorOut.writeBoolean(hasSrcDataNode);

if (hasSrcDataNode) { // pass src node information

srcDataNode.write(mirrorOut);

}

mirrorOut.writeInt( targets.length - 1 );

//此出也是从1开始，将除了下一个DataNode的其他DataNode信息发送给下一个DataNode

for ( int i = 1; i < targets.length; i++ ) {

targets[i].write( mirrorOut );

}

blockReceiver.writeChecksumHeader(mirrorOut);

mirrorOut.flush();

}

//使用BlockReceiver接受block

String mirrorAddr = (mirrorSock == null) ? null : mirrorNode;

blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut,

mirrorAddr, null, targets.length);

......

} finally {

// close all opened streams

IOUtils.closeStream(mirrorOut);

IOUtils.closeStream(mirrorIn);

IOUtils.closeStream(replyOut);

IOUtils.closeSocket(mirrorSock);

IOUtils.closeStream(blockReceiver);

}

}

BlockReceiver的receiveBlock函数中，一段重要的逻辑如下：

void receiveBlock(

DataOutputStream mirrOut, // output to next datanode

DataInputStream mirrIn, // input from next datanode

DataOutputStream replyOut, // output to previous datanode

String mirrAddr, BlockTransferThrottler throttlerArg,

int numTargets) throws IOException {

......

//不断的接受package，直到结束

while (receivePacket() > 0) {}

if (mirrorOut != null) {

try {

mirrorOut.writeInt(0); // mark the end of the block

mirrorOut.flush();

} catch (IOException e) {

handleMirrorOutError(e);

}

}

......

}

BlockReceiver的receivePacket函数如下：

private int receivePacket() throws IOException {

//从客户端或者上一个节点接收一个package

int payloadLen = readNextPacket();

buf.mark();

//read the header

buf.getInt(); // packet length

offsetInBlock = buf.getLong(); // get offset of packet in block

long seqno = buf.getLong(); // get seqno

boolean lastPacketInBlock = (buf.get() != 0);

int endOfHeader = buf.position();

buf.reset();

setBlockPosition(offsetInBlock);

//将package写入下一个DataNode

if (mirrorOut != null) {

try {

mirrorOut.write(buf.array(), buf.position(), buf.remaining());

mirrorOut.flush();

} catch (IOException e) {

handleMirrorOutError(e);

}

}

buf.position(endOfHeader);

int len = buf.getInt();

offsetInBlock += len;

int checksumLen = ((len + bytesPerChecksum - 1)/bytesPerChecksum)*

checksumSize;

int checksumOff = buf.position();

int dataOff = checksumOff + checksumLen;

byte pktBuf[] = buf.array();

buf.position(buf.limit()); // move to the end of the data.

......

//将数据写入本地的block

out.write(pktBuf, dataOff, len);

/// flush entire packet before sending ack

flush();

// put in queue for pending acks

if (responder != null) {

((PacketResponder)responder.getRunnable()).enqueue(seqno,

lastPacketInBlock);

}

return payloadLen;

}

评论这张

转发至微博

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：Hadoop学习总结之二：HDFS读写过程解析1(zz)

后一篇：Hadoop学习总结之三：Map-Reduce入门(zz)

新浪BLOG意见反馈留言板　欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

Copyright © 1996 - 2022 SINA Corporation, All Rights Reserved

新浪公司版权所有