加载中…
个人资料
firefly
firefly
  • 博客等级:
  • 博客积分:0
  • 博客访问:421,686
  • 关注人气:62
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
相关博文
推荐博文
谁看过这篇博文
加载中…
正文 字体大小:

Data Miners Blog数据矿工的博客  译

(2011-10-19 22:06:44)
标签:

杂谈

分类: mapreduce
skip to main跳转到主要|| skip to sidebar跳转到侧边栏

A place to read about topics of interest to data miners, ask questions of the data mining experts at Data Miners, Inc., and discuss the books of Gordon Linoff and Michael Berry.一个地方阅读话题感兴趣的数据矿工,提问数据挖掘的数据矿工专家公司,并讨论Linoff的书和迈克尔·贝瑞。戈登。

Showing newest posts with label MapReduce.显示最新的帖子MapReduce标签。Show older posts显示年长的帖子
Showing newest posts with label MapReduce.显示最新的帖子MapReduce标签。Show older posts显示年长的帖子

Saturday, January 9, 2010星期六,2010年1月9日

Hadoop and Parallel Dataflow ProgrammingHadoop和并行数据编程

Over the past three months, I have been teaching myself enough Hadoop to get comfortable with using the environment for analytic purposes.在过去的三个月,我一直教我自己足够的Hadoop轻松使用的环境分析的目的。

There has been a lot of commentary about Hadoop/MapReduce versus relational databases (such as the articles referenced in my previous现在已经有许多关于Hadoop的评语/ MapReduce与关系型数据库(如我在以前的文章引用post岗位on the subject). 在这个问题上)。I actually think this discussion is misplaced because comparing open-source software with commercial software aligns people on "religious" grounds. 我真的认为这个讨论是不合理的,因为比较开源软件的人在商业软件联盟“宗教”理由。Some people will like anything that is open-source. 有些人会喜欢什么,也是开源的。Some people will attack anything that is open-source (especially people who work for commercial software vendors). 有些人就会攻击任何开源(特别是那些作品用于商业软件供应商)。And, the merits of real differences get lost. 而且,真正差别的优点迷路。Both Hadoop and relational databases are powerful systems for analyzing data, and each has its own distinct set of advantages and disadvantages.两Hadoop和关系型数据库是强大的系统分析数据,并各有其独特的套的优点和缺点。

Instead, I think that Hadoop should be compared to a parallel dataflow style of programming. 相反,我认为Hadoop要比平行数据编程风格的。What is a dataflow style of programming? 什么是数据编程风格的?It is a style where we watch the data flow through different operations, forking and combining along the way, to achieve the desired goal. 这是一个风格,我们看到数据流从不同的业务,并结合分岔在前进的道路上,达到预期的目标。Not only is a dataflow a good way to understand relational databases (which is why I introduce it in Chapter 1 of不仅是一个数据流的一种好方法去理解关系型数据库(这就是为什么我在第一章介绍它的Data Analysis Using SQL and Excel资料分析使用SQL,Excel), but the underlying engines that run SQL queries are dataflow engines.),但潜在的引擎,运行SQL查询被数据引擎。

Parallel dataflows extend dataflow processing to grid computing. 并行数据流处理dataflows延长网格计算。To my knowledge, the first commercial tool that implements parallel dataflows was developed by据我所知,第一个商业性的工具,实现了并行dataflows研制而成Ab Initio自始.This company was a spin-off from a bleeding edge parallel supercomputer vendor called本公司是一个衍生流血的边缘平行的超级打电话。供应商Thinking Machines思考的机器that went bankrupt in 1994. 这工厂在1994年破产。As a matter of full disclosure: Ab Initio was actually formed from the group that I worked for at Thinking Machines. 作为一种应付问题的充分披露:自始实际上是形成了从集团我工作了思考的机器。Although they are very, very, very resistant to sharing information about their technology, I am rather familiar it. 虽然他们是非常、非常、非常耐共享信息关于他们的技术,我很熟悉。I believe that the only publicly available information about them (including screen shots) is published in our book我相信只有公开可得到的信息(包括屏幕快照)发表在我们的书 Mastering Data Mining: The Art and Science of Customer Relationship Management掌握数据挖掘:艺术和科学的客户关系管理上.

I am confident that Apache has at least one dataflow project, since when I google "dataflow apache" I get a pointer to the我深信架阿帕奇都至少有一个数据项目,从那时起我谷歌“数据流架阿帕奇“我得到一个指针Dapper时髦project. 项目。My wish, however, is that Hadoop were the parallel dataflow project.我的愿望,不过,就是Hadoop是平行数据项目。

Much of what Hadoop does goes unheralded by the typical MapReduce user. Hadoop做太多的去的MapReduce总是典型用户。On a massively parallel system, Hadoop keeps track of the different parts of an HDFS file and, when the file is being used for processing, Hadoop does its darndest to keep the processing local to each file part being processed. 在一个大规模并行系统,Hadoop跟踪一个HDFS的不同部位的文件,当文件被用于处理,Hadoop做它的darndest保持局部处理每个文件一部分处理。This is great, since data locality is key to achieving good performance.这是伟大的,因为本地关键数据取得了良好的性能。

Hadoop also keeps track of which processors and disk systems are working. 也Hadoop跟踪处理器和磁盘系统的工作。When there is a failure, Hadoop tries again, insulating the user from sporadic hardware faults.当有一个失败,Hadoop继续探索、绝缘用户从零星的硬件故障。

Hadoop also does a pretty good job of shuffling data around, between the map and reduce operations. Hadoop也不错的周围把数据之间、地图和减少操作。The shuffling method -- sorting, send, and sort again -- may not be the most efficient but it is quite general.拖拉的方法——排序,发送,和排序了——或许不是最有效,但是却很一般。

Alas, there are several things that Hadoop does not do, at least when accessed through the MapReduce interface. 唉,有几件事情Hadoop也不做,至少在通过MapReduce访问接口。Supporting these features would allow it move beyond the MapReduce paradigm, giving it the power to support more general parallel dataflow constructs.支持这些特点使其超越MapReduce范式,给予它的能力来支持更一般的平行数据结构。

The first thing that bothers me about Hadoop is that I cannot easily take a text file and just copy it with the Map/Reduce primitives. 首先对Hadoop困扰我的是我无法轻易地把一个文本文件和只是复制与地图/减少元素。Copying a file seems like something that should be easy. 复制一个文件看起来好像应该很容易。The problem is that a key gets generated during the map processing. 问题是,一个重要时产生的地图得到处理。The original data gets output with a key prepended, unless I do a lot of work to parse out the first field and use it as a key.原始数据得到输出的钥匙前缀,除非我做了大量工作以解析的第一个领域,使用它作为一个关键。

Could the可以 context.write()context.write()function be overloaded with a version that does not output a key? 功能超负荷版本,不输出一个钥匙吗?Perhaps this would only be possible in the reduce phase, since I understand the importance of the key for going from map to reduce.这也许只会可能减少阶段,因为我的重要性认识不足,要从地图上的关键降低。

A performance issue with Hadoop is the shuffle phase between the map and the reduce. 绩效问题是Hadoop洗牌阶段之间的地图和减少。As I mentioned earlier, the sort-send-sort process is quite general. 正如我早先提到的,sort-send-sort过程非常一般。Alas, though, it requires a lot of work. 唉,虽然它需要大量的工作。An alternative that often works well is simply hashing. 另一个经常工作来说是散列。To maintain the semantics of map-reduce, I think this would be hash-send-combine or hash-send-sort. map-reduce保持语义的,我认为这将是hash-send-combine或hash-send-sort。The beauty of using hashing is that the data can be sent to its destination while the map is still processing it. 美丽的使用散列是数据可以送到目的地而地图仍然是处理它。This allows concurrent use of the processing and network during this operation.这使得并行处理和网络的使用在这个操作。

And, speaking of performance, why does the key have to go before the data? ,谈到表演,为什么要走的关键数据吗?Why can't I just point to a sequence of bytes and use that for the key? 为什么我不能指出一个字节序列和使用的钥匙吗?This would enable a programming style that doesn't spend so much time parsing keys and duplicating information between values and keys.这将使编程风格,不花那么多时间来解析键和值之间的信息复制和钥匙。

Perhaps the most frustrating aspect of Hadoop is the MapReduce framework itself. 也许最令人沮丧的方面是MapReduce Hadoop框架本身。The current version allows processing like (M+)(R)(M*). 当前版本允许处理像(M +)(R)(M *)。What this notation means is that the processing starts with one or more map jobs, goes to a reduce, and continues with zero or more map jobs.这个符号意味着加工始于一个或多个地图的工作,去降低,并继续着零个或多地图的工作。

THIS IS NOT GENERAL ENOUGH! 这不是一般的够了!I would like to have an arbitrary number of maps and reduces connected however I like. 我想有一个任意数量的地图和减少然而我喜欢。连接So, one map could feed所以,一个地图可以喂养 two different reduces两个不同的减少, each having different keys. ,每一种都有不同的钥匙。At the same time, one of the reduces could feed another reduce without having to go through an intermediate map phase.同时,能够满足另一个减少减少无需经由一个中级地图阶段。

This would be a big step toward parallel dataflow parallel programming, since Map and Reduce are two very powerful primitives for this purpose.这将是一个巨大的并行数据流的并行程序设计的一步,因为地图,减少是两个非常强大的原是为此。

There are some other primitives that might be useful. 有一些其他的元素,可能会是有用的。One would be一个 broadcast广播.This would take the output from one processing node during one phase and send it to all the other nodes (in the next phase). 这将从一个处理节点的输出在一个阶段,把它发给所有其他节点(在下一期)。Let's just say that using我只是想说使用 broadcast广播, it would be much easier to send variables around for processing. ,它就会容易多了送变量在加工。No more defining weird variables using "set" in the main program, and then parsing them in定义变量不再使用“设置”怪异的主要程序,然后解析 setup()设置()functions. 功能。No more setting up temporary storage space, shared by all the processors. 没有建立临时的储藏空间,分享所有的处理器。No more using HDFS to store small serial files, local to only one node. 没有更多的使用HDFS存储小系列文件,当地的只有一个节点。Just send data through a broadcast, and it goes everywhere. 仅仅通过广播发送数据,它无处不在。(If the broadcast is running on more than one node, then the results would be concatenated together, everywhere.)(如果广播是运行于多个节点,那么结果会接在一起,到处都是。)

And, if I had a broadcast, then my two-pass row number code (如果我有一个广播,那么我的two-pass排次数密码(here这里) would only require one pass.)只会要求一个传球。

I think Hadoop already supports having multiple different input files into one reduce operator. 我想Hadoop已支持多种不同的输入文件到一个减少操作人员。This is quite powerful, and a much superior way of handling join processing.这是相当强大,比加入加工处理方法。

It would also be nice to have a final sort operator. 它也会很高兴有一个最终排序操作。In the real world, people often do want sorted results.在现实世界中,人们常常要分类结果。

In conclusion, parallel dataflows are a very powerful, expressive, and efficient way of implementing complex data processing tasks. 总之,平行dataflows是一个非常强大的,富于表情的,高效的实现方式复杂的数据处理任务。Relational databases use dataflow engines for their processing. 关系数据库引擎使用数据流的处理。Using non-procedural languages such as SQL, the power of dataflows are hidden from the user -- and, some relatively simple dataflow constructs can be quite difficult to express in SQL.使用non-procedural语言如SQL,dataflows是隐藏的力量从用户——一些相对简单的数据构建可以相当难以用语言表达,在SQL。

Hadoop is a powerful system that emulates parallel dataflow programming. Hadoop是一个功能强大的系统,模拟并行数据编程。Any step in a dataflow can be implemented using a MapReduce pass -- but this requires reading, writing, sorting, and sending the data multiple times. 任何一步可以实现数据使用一个MapReduce——但这需要通过阅读、写作、整理、发送数据多次。With a few more features, Hadoop could efficiently implement parallel dataflows. 有更多的功能,可以有效的实现dataflows Hadoop平行。I feel this would be a big boost to both performance and utility, and it would leverage the power already provided by the Hadoop framework.我觉得这将是一个巨大的鼓舞作用及表现的用途上,而它将影响作用的力量已经提供Hadoop框架。

Tuesday, January 5, 2010周二,1月5日,

MapReduce versus Relational Databases?MapReduce与关系型数据库吗?

The current issue of最新一期的《 Communications of the ACM通讯的ACMhas articles on MapReduce and relational databases. 有文章在MapReduce和关系数据库。One,一个,MapReduce a Flexible Data Processing ToolMapReduce灵活的数据处理工具, explains the utility of MapReduce by two Google fellows -- appropriate authors, since Google invented the parallel MapReduce paradigm.,阐述了由两个谷歌MapReduce效用的家伙——适当的作者,因为谷歌发明了平行MapReduce范式。

The second article,第二篇文章, MapReduce and Parallel DBMSs: Friend or FoeMapReduce和并行dbms:朋友还是敌人, is written by a team of authors, with Michael Stonebraker listed as the first author. ,是一组作者写的,迈克尔Stonebraker列为第一作者。I am uncomfortable with this article, because the article purports to show the superiority of a particular database system, Vertica, without mentioning -- anywhere -- that Michael Stonebraker is listed as the CTO and Co-Founder on Vertica's我不舒服,这篇文章,因为文章旨在展示了一个特定的数据库系统的优越性,Vertica,但没有提及——任何地方——迈克尔Stonebraker被列为共同创始人在临时难民培养中心和Verticaweb site网站.For this reason, I believe that this article should be subject to much more scrutiny.因为这个原因,我相信这篇文章应受到更多的审查。

Before starting, let me state that I personally have no major relationships with any of the database vendors or with companies in the Hadoop/MapReduce space. 开始之前,让我声明我个人没有主要的关系数据库供应商或任何与公司在Hadoop / MapReduce空间。I am an advocate of using relational databases for data analysis and have written a book called我是一个提倡使用关系数据库的数据分析与写过一本名叫Data Analysis Using SQL and Excel资料分析使用SQL,Excel.And, over the past three months, I have been learning Hadoop and MapReduce, as attested to by numerous blog postings on the subject. 在过去的三个月里,我一直都在苦苦探究,MapReduce Hadoop,证明鹭博客帖子的内容。Perhaps because I am a graduate of MIT ('85), I am upset that Michael Stonebraker uses his MIT affiliation for this article, without mentioning his Vertica affiliation.也许是因为我毕业于麻省理工学院(85年),我伤心,迈克尔Stonebraker用他的麻省理工学院所属这篇文章,但没有提及他的Vertica联系。

The first thing I notice about the article is the number of references to Vertica. 我注意到的第一件事是这篇文章的数量Vertica参考。In the main text, I count nine references to Vertica, as compared to thirteen mentions of other databases:在正文的时候,我细数Vertica九参考,比较十三提到了其他数据库:
  • Aster (twice)horticultural abstracts vol . no(两次)
  • DataAllegro (once)DataAllegro(一次)
  • DB2 (twice)DB2(两次)
  • Greenplum (twice)Greenplum(两次)
  • Netezza (once)Netezza(一次)
  • ParAccel (once)ParAccel(一次)
  • PostgreSQL (once)p ostgresql的(一次)
  • SQL Server (once)SQL服务器(一次)
  • Teradata (once)帮助(一次)
The paper describes a study which compares Vertica, another database, and Hadoop on various tasks. 摘要本文提出一种研究Vertica相比,另一个数据库,和Hadoop完成各种各样的任务。The paper never explains how these databases were chosen for this purpose. 本文从不如何选择这些数据库的作用。Configuration issues for the other database and Hadoop are mentioned. 配置问题进行了其他数据库和Hadoop提及。The configuration and installation of Vertica -- by the absence of problems -- one assumes is easy and smooth. Vertica配置与安装的问题- - - - - - - - - - - - - -没有一个假设是容易的,光滑。I have not (yet) read the paper cited, which describes the work in more detail.我没有(然而)阅读该报引述,描述工作的更多细节。

Also, the paper never describes costs for the different system, which is a primary driver of MapReduce. 同时,本文从未描述不同的系统成本,这是一个基本的司机MapReduce。The software is free and runs on cheap clusters of computers, rather than expensive servers and hardware. 软件是免费的,在廉价的计算机集群运行,而不是昂贵的服务器和硬件。For a given amount of money, MapReduce may provide a much faster solution, since it can support much larger hardware environments.对于一个给定的数额的金钱,MapReduce解决方案提供了一种快得多,因为它可以支持更大的硬件环境。

The paper never describes issues in the loading of data. 本文从不描述问题的数据加载。I assume this is a significant cost for the databases. 我认为这是一个显著的成本为数据库。Loading the data for Hadoop is much simpler . . . Hadoop加载数据是非常简单的。since it just reads text files, which is a common format.因为它只是读取文本文件,这是一种常见的格式。

From what I can gather, the database systems were optimized specifically for the tasks at hand, although this is not explicitly mentioned anywhere. 从我所能搜集、数据库系统进行了优化设计是专门为手边的任务,尽管这是没有明确提到的任何地方。For instance, the second tasks is a例如,第二个任务是a GROUP BY集团, and I suspect that the data is hash partitioned by the,我怀疑数据分块的哈 GROUP BY集团clause.条款。

There are a few statements that I basically disagree with.有几个声明,我根本就不同意。

"Lastly, the reshuffle that occurs between the Map and Reduce tasks in MR is equivalent to a GROUP BY operation in SQL.""最后,这次改组之间产生,降低任务地图是等价的,先生,在SQL手术一组。”The issue here at first seems like a technicality. 这个问题在这里起初似乎是一个技术性。In a relational database, an input row can only into one group. 在关系数据库中,一个输入行只能成为一组。MR can output multiple records in the map stage, so a single row can go into multiple "groups". 可以输出多个记录先生地图的阶段,所以一个单独列可以在多个“组”。This functionality is important for the word count example, which is the canonical MapReduce example. 此功能是一种重要的单词计数的例子,而且是MapReduce典型例子。I find it interesting that this example is not included in the benchmark.我发现它很有趣,这个例子不包括在基准。

"Given this, parallel DBMSs provide the same computing model as MR, with the added benefit of using a declarative language (SQL)."“在这种情况下,并行dbms提供同样的计算模型,通过增加先生使用声明性语言的利益(SQL)。”This is not true in several respects. 这不是真的在几个方面。First, MapReduce does have associated projects for supporting declarative languages. 首先,MapReduce有相关计划支持的声明的语言。Second, in order for SQL to support the level of functionality that the authors claim, they need to use user defined functions. 第二,为了支持水平的SQL的功能,作者声称,他们需要使用用户定义的功能。Is that syntax declarative?那是语法的声明吗?

More importantly, though, is that the computing model really is not exactly the same. 但是,更重要的是,是真正的计算模型是不完全一样的。Well, with SQL extensions such as嗯,它与SQL延伸等 GROUPING SET分组设定s and window functions, the functionality does come close. 年代和窗口功能,功能也会关闭。But, consider the ways that you can add a row number to data (assuming that you have no row number function built-in) using MapReduce versus traditional SQL. 但是,考虑的方式,你可以添加一排数字数据(假设你没有排数字功能嵌入)使用MapReduce与传统SQL。Using MapReduce you can follow the two-phase program that I described in an earlier你可以使用MapReduce遵循“政��合一”计画,我在前面的描述posting张贴.With traditional SQL, you have to do a non-equi-self join. 与传统SQL,你必须做一件non-equi-self加入。MapReduce has a much richer set of built-in functions and capabilities, simply because it uses java, an established programming language with many libraries.有丰富MapReduce套内建的功能和能力,只是因为它用途爪哇建立的程序设计语言,由许多图书馆。

On the other hand, MapReduce does not have a concept of "null" built-in (although users can define their own data types and semantics). 另一方面,MapReduce没有概念里的“空”内置(尽管用户可以定义他们自己的数据类型和语义)。And, MapReduce handles non-equijoins poorly, because the key is used to direct both tables to the same node. ,MapReduce处理non-equijoins不好,因为关键是用来指导两个表格相同的节点。In effect, you have to limit the MapReduce job to one node. 实际上,你必须限制MapReduce工作一个节点。SQL can still parallelize such queries.SQL仍然可以并行化这样的疑问。

"[MapReduce] still requires user code to parse the value portion of the record if it contains multiple attributes."“[MapReduce]仍然需要用户代码来分析价值的部分如果它含有多个属性的记录。”Well, parse is the wrong term, since a嗯,解析被错误的术语,因为一个 Writable可写的class supports binary representations of data types. 二班的代表支持的数据类型。I describe how to create such types我描述如何创造这种类型here这里.

I don't actually feel qualified to comment on many of the operational aspects of optimizing Hadoop code. 我真的不觉得有资格评论很多操作方面的优化Hadoop代码。I do note that the authors do not explain the main benefit of Vertica, which is the support of column partitioning. 我注意到作者不解释Vertica最大的好处,这是支持柱划分。Each column is stored separate, which makes it possible to apply very strong compression algorithms to the data. 每一栏储存分开,这使得有可能很强的压缩算法应用到数据。In many cases, the Vertica data will fit in memory. 在许多情况下,Vertica数据将适合的记忆。This is a huge performance boost (and one that another vendor, Paracel takes advantage of).这是一个巨大的性能的推进(,另一个供应商,Paracel利用)。

In the end, the benchmark may be comparing the in-memory performance of a database to general performance for MapReduce. 最后,可能比较的基准的内存数据库的性能MapReduce一般性能。The benchmark may not be including the ETL time for loading the data, partitioning data, and building indexes. 基准不得包括ETL数据的装船期限、分区数据,建设指标。The benchmark may not have allocated optimal numbers of map and reduce jobs for the purpose. 基准可能没有分配的最优数量的地图,减少工作的目的。And, it is possible that the benchmark is unbiased and relational databases really are better.并且,它也有可能是偏见,关系型数据库基准是真的是好。

A paper that leaves out the affiliations between its authors and the vendors used for a benchmark is only going to invite suspicion.一篇论文,叶之间的关系及其作者和销售商用于基准只是打算邀请的怀疑。

Saturday, January 2, 2010星期六,2010年1月2号,

Hadoop and MapReduce: Normalizing Data StructuresHadoop和MapReduce:规范的数据结构

To set out to learn Hadoop and Map/Reduce, I tackled several different problems. Hadoop出发去学习和地图/减少,我解决不同的问题。The last of these problems is the challenge of normalizing data, a concept from the world of relational databases. 最后这些问题是规范数据的挑战,从世界的概念关系型数据库。The earlier problems were adding早在增加的问题sequential row numbers顺序排编号andcharacterizing values in the data特征值的数据.

This posting describes data normalization, explains how I accomplished it in Hadoop/MapReduce, and some tricks in the code. 本文描述了数据正常化,说明它在我完成了Hadoop / MapReduce,一些技巧在本守则。I should emphasize here that the code is really "demonstration" code, meaning that I have not worked hard on being sure that it always works. 在此我要特别强调的,真的是“示范”代码的代码,这意味着我没有努力工作,对以保证它总是作品。My purpose is to demonstrate the idea of using Hadoop to do normalization, rather than producing 100% working code.我的目的是要证明这种利用Hadoop做正常化,而不是生产100%工作代码。


What is Normalization and Why Do Want To Do It?什么是规范化、为什么要这样做?

Data normalization is the process of extracting values from a single column and placing them in a reference table. 数据规范化的过程中提取一个柱和值放到一参考表。The data used by Hadoop is typically unnormalized, meaning that data used in processing is in a single record, so there is no need to join in reference tables. 利用数据的Hadoop通常是unnormalized,这意味着数据用于处理在一个单一的记录,所以没有必要参加参考表格。In fact, doing a join is not obvious using the MapReduce primitives, although my understanding is that Hive and Pig -- two higher level languages based on MapReduce -- do incorporate this functionality.事实上,做一个连接是使用MapReduce原不明显,虽然我的理解就是蜂房和猪——两个更高层次的语言——基于MapReduce合并这功能。

Why would we want to normalize data? 为什么我们要规范数据?(This is a good place to plug my book(这是一个好地方,来塞我的书Data Analysis Using SQL and Excel资料分析使用SQL,Excel, which explains this concept in more detail in the first chapter.) ,这就说明了这一概念更详细的第一章。)In the relational world, the reason is something called "relational integrity", meaning that any particular value is stored in one, and only one, place. 在关系的世界,原因是所谓的“关系诚信”,这意味着任何特定的值是存储在只有一个地方。For instance, if the state of California were to its name, we would not want to update every record from California. 例如,如果加利福尼亚的状态,它的名字,我们不会想要更新所有记录来自加利福尼亚。Instead, we'd rather go to the reference table and just change the name to the new name, and the data field contains a state id rather than the state name itself. 相反,我们宁愿去参考表和只是更改命名为“新名字,和数据栏位包含一个国家身份证,而不是国家的名字本身。Relational integrity is particularly important when data is being updated.诚信是尤其重要的关系,数据被更新。

Why would we want to normalize data used by Hadoop? 为什么我们要规范所用的数据Hadoop吗?There are two reasons. 有两个原因。The first is that we may be using Hadoop processing to load a relational database -- one that is already designed with appropriate reference tables. 首先,我们可能使用Hadoop处理负荷关系数据库,已设计合适的参照表。This is entirely reasonable, relational databases are an attractive way to "publish" results from complex data processing since they are better for creating end-user reports and building interactive GUI interfaces.这是完全合情合理、关系型数据库是一个有吸引力的方式“出版”结果,从复杂的数据处理,因为它们更适合创建用户报告和建立互动式GUI(图形用户界面)界面。

The second reason is performance. 第二个原因是性能。Extracting long strings and putting them in a separate reference table can significantly reduce the storage requirements for the data files. 提取长字符串,让他们在一个独立的参考表格可以极大地降低数据文件存储要求。By far, most of the space taken up in typical log files, for instance, consists of long URIs (what I used to call URLs). 到目前为止,大部分的太空接在典型的日志文件,例如,由长URIs(我常称的url)。When processing the log files, we might want to extract some features from the URIs, but keeping the entire string just occupies a lot of space -- even in a compressed file.在处理日志文件,我们可能想要从URIs提取的一些特点,但保持整个字符串只是占用大量的空间——即使在压缩文件中。


The Process of Normalizing Data规范数据的过程

Normalizing data starts with data structures. 从规范数据的数据结构。The input records are assumed to be in a delimited format, with the column names in the first row (or provided separately, although I haven't tested that portion of the code yet). 输入记录被认为是在一个分隔的格式,与柱的名字写在第一排(或分别报出,虽然我没有测试过的那部分代码为止)。In addition, there is a "master" id file that contains the following columns:此外,有一个“大师”id文件包含下列栏目:
  • id -- a unique id for every value by column.身份证——每一个唯一的id值栏。
  • column name -- the name of the column.列名称- - -这个名字的专栏。
  • value -- the id in the column.价值(即身份证柱。
  • count -- the total number of times the value as so far occurred.数——总次数的价值为迄今为止发生。
This is a rudimentary reference file. 这是一个基本的参考文件。I could imagine, for instance, having more information than just the count as summary information -- perhaps the first and last date when the value occurs, for instance.我可以想象,例如,拥有更多的信息仅仅视为汇总信息——也许第一个和最后一个日期价值时发生,例如。

What happens when we normalize data? 我们规范的数据时,会发生什么事?Basically, we look through the data file to find new values in each column being normalized. 基本上,我们看遍数据文件中,以寻找新的价值在每一栏被正常化。We append these new values into the master id file, and then go back to the original data and replace the values with the ids.我们附上这些新的价值观主人身份文件,然后回到原来的数据和更换的价值观与身份证。

Hadoop is a good platform for this for several reasons. 是一个很好的平台Hadoop这有几个原因。First, because the data is often stored as text files, the values and the ids have the same type -- text strings. 首先,因为数据往往是储存为文本文件,价值观和入侵检测系统(ids)有同类型——文本字符串。This means that the file structures remain the same. 这意味着文件结构是相同的。Second, Hadoop can process multiple columns at the same time. 第二,Hadoop可以处理多个列在同一时间内。Third, Hadoop can use inexpensive clusters and free software for this task, rather than relying on databases and tools, which are often more expensive.第三,Hadoop可以用便宜的企业集群与自由软件在这个任务上,而不是依赖数据库和工具,其受欢迎程度甚至超过了昂贵。

How To Normalize Data Using Hadoop/MapReduce如何规范数据使用Hadoop / MapReduce吗

The normalization process has six steps. 六层台阶正常化的过程。Most of these correspond to a single Map-Reduce pass.大部分的这些对应于一个单一的Map-Reduce通过。

Step 1: Extract the column value pairs from the original data.第一步:萃取柱值对从原始数据。

This step explodes the data, by creating a new data set with multiple rows for each row in the original data. 这一步爆数据,通过创造一种新的数据集多排在原来的为每一行数据。Each output row contains a column, a value, and the number of times the value appears in the data. 每个输出行包含一个专栏,一个值,和价值的次数出现在数据。Only columns being normalized are included in the output.只有列被规范化都包括在输出。

This step also saves the column names for the data file in a temporary file. 这一步柱的名字也保存在一个临时数据文件的文件。I'll return to why this is needed in Step 6.我将回到这是为什么需要在第6步。

Step 2: Extract column-value Pairs Not In Master ID File第二步:提取column-value双硕士ID文件不是

This step compares the column-value pairs produced in the first step with those in the master id file. 这一步比较双column-value产生的第一步,那些在主人的身份文件。This step is interesting, because it reads data from two different data source formats -- the master id file and the results from Step 1. 这一步是有趣,因为它读取数据来自两个不同的数据源格式——主id文件和结果,从第一步。Both sets of data files use the这两类数据文件使用了 GenericRecordGenericRecordformat.格式。

To identify the master file, the map function looks at the original data to see whether "/master" appears in the path. 主文件的识别,地图在原始数据的函数看起来是否出现在" /大师”的道路。Alternative methods would be to look at the动物实验替代方法就是看看 GenericRecordGenericRecordthat is created or to use那是创建或使用 MultipleInputsMultipleInputs(which I didn't use because of a(我没有使用,因为一个了warning警告on Cloudera's web site).在Cloudera网站)。


Step 3: Calculate the Maximum ID for Each Column in the Master File第三步:计算最大名字中的每一列主文件

This is a very simple Map-Reduce step that simply gets the maximum id for each column. 这是一个很简单的Map-Reduce一步得到最大的id,仅仅为每一个专栏。New ids that are assigned will be assigned one more than this value.新的证件,将被分配一个超过此值的所有网站成员。

This is an instance where I would very much like to have two different reduces following a map step. 这是一个例子:我非常想有两种不同的减少地图一步。以下If this were possible, then I could combine this step with step 2.如果这是可能的,那我可以把这个步第2步。


Step 4: Calculate a New ID for the Unmatched Values第四步:计算出新的ID为无与伦比的价值

This is a two step process that follows the mechanism for adding row numbers discussed in one of my earlier这是一个两步的过程,遵循机制来��加排了我的一个数字更早posts岗位, with one small modification. ,一个小的修改。The final result has the maximum id value from Step 3 added onto it, so the result is a new id rather than just a row number.最后的结果具有最大的id值从第三步补充说到它时,那么结果是一个新身份,而不只是一个排号码。


Step 5: Merge the New Ids with the Existing Master IDs第五步:合并新Ids与现有的主人的id

This step merges in the results from Step 4 with the existing master id file. 这个步骤的结果合并在第四步与现有的主人身份文件。Currently, the results are placed into another directly. 目前,结果被安置到另一个直接的。Eventually, they could simply override the master id file.最终,他们可以简单地推翻主人身份文件。

Because of the structure of the Hadoop file system, the merge could be as simple as copying the file with the new ids into the appropriate master id data space. 由于结构的Hadoop文件系统、合并可以尽可能简单的复制文件与新ids成适当的主人身份数据空间。However, this would result in an unbalanced master id file, which is probably not desirable for longer term processing.然而,这将导致主id文件的不平衡,这也许是不可取的长期加工。


Step 6: Replace the Values in the Original Data with IDs第六步:取代的价值观与原始数据的id

This final step replaces the values with ids -- the actual normalization step. 这最后一步代替了价值观与入侵检测系统(ids)——实际规范化的一步。This is a two part process. 这是一个两部分的过程。The map phase of the first part takes both the original data and the master key file. 地图阶段的第一部分以原始数据和掌握重要文件。All the column value pairs are exploded from the original data, as in Step 1, with the output consisting of:所有的列了价值两个原始数据,就像在第一步,与输出内含:
  • key:关键字: :
  • value:< "expect"|"nomaster">,价值:< "期望“|”nomaster”>, ,
The first part ("expect" or "nomaster") is an indicator of whether this column should be normalized (that is, whether or not to expect a master id). 第一部分(“期望”或“nomaster”)是一个指标,此栏是否应规格化(即,是否期望一个主人身份)。The second field identifies the original data record, which is uniquely identified by the partition id and row number within that partition. 第二场识别原始数据记录,这是唯一识别分割数字id和排在隔断。The third is the column number in the row.第三列数中列支。

The master records are placed in the format:主记录将被置于格式:
  • key:关键字: :
  • value: "master",价值:“大师”,
The reduce then reads through all the records for a given column-value combination. 然后翻阅的减少所有的记录对于一个给定的column-value组合。If one of them is a master, then it outputs the id for all records. 如果其中一人是硕士,然后它输出的id为所有的记录。Otherwise, it outputs the original value.否则,它输出的原有价值。

The last phase simply puts the records back together again, from their exploded form. 最后一个阶段只是把记录重聚的时候,从他们的爆炸形式。The one trick here is that the metadata is read from a local file.在这里,一个绝招元数据读取一个本地文件。


Tricks Used In This Code技巧应用于该代码

The code is available in these files:代码可在这些文件:Normalize.javaNormalize.java,GenericRecordInputFormat.javaGenericRecordInputFormat.java,GenericRecord.javaGenericRecord.java, and,GenericRecordMetadata.javaGenericRecordMetadata.java.This code uses several tricks along the way.这个代码使用几个技巧在前进的道路上。

One trick that I use in Step 4, for the phase 1 map, makes the code more efficient. 我使用一个绝招第四步骤,为第一阶段,地图,让程序更有效率。This phase of the computation extracts the maximum row number for each column. 这一阶段的计算中的最大行数每一栏。Instead of passing all the row numbers to a combine or reduce function, it saves them in a local hash-map data structure. 不是去通过所有的行数或减少使用功能,它节省了当地的hash-map数据结构。I then use the然后我使用 cleanup()清理()routine in the map function to output the maximum values.在地图的功能程序输出的最大价值。

Often the master code needs to pass variables to the map/reduce jobs. 经常主人的代码需要通过变量/减少工作的地图。The best way to accomplish this is by using the "set" mechanism in the最好的方法来完成这是用“设置”机制 Configuration配置object. 对象。This allows variables to be assigned a string name. 这允许变量被分配一个字符串名称。The names of all the variables that I use are stored in constants that start with所有的变量的名字,我用被储存在常数开始 PARAMETER_PARAMETER_, defined at the beginning of the定义开始 Normalize规范class.上课。

In some cases, I need to pass arrays in, for instance, when passing in the list of column that are to be normalized. 在某些情况下,我需要通过数组中,例如,在通过列表中的柱,正常化。In this case, one variable gives the number of values ("normalize.usecolumns.numvals"). 在这种情况下,给出了一个变量的数值(“normalize.usecolumns.numvals”)。Then each value is stored in a variable such as "normalize.usecolumns.然后每个值是存储在一个变量,如“normalize.usecolumns。0" and "normalize.usecolumns.0”和“normalize.usecolumns。1" and so on.1”等等。

Some of the important processing actually takes place in the master loop, where results are gathered and then passed to subsequent steps using this environment mechanism.一些重要的处理其实发生在主回路,那里聚集,然后通过结果对后来的步骤使用这个环境机制。

The idea behind the背后的意念 GenericRecordGenericRecordclass is pretty powerful, with the column names at the top of the file.类是非常强大的,与列名称在顶部的文件。 GenericRecordGenericRecords make it possible to read multiple types of input in the same map class, for instance, which is critical functionality for combining data from two different input streams.年代可能阅读多种类型的输入在同一地图类,例如,至关重要的功能,结合数据来自两个不同输入流。

However, the Map-Reduce framework does not really recognize these column names as being different, once generic records are placed in a sequence file. 然而,Map-Reduce框架并非真的认识这些列名称是不同的,一旦通用的记录被放在一个序列的文件。The metadata has to be passed somehow.元数据是通过以某种方式。

When the code itself generates the metadata, this is simple enough. 当代码生成元数据,这是很简单。A function is used to create the metadata, and this function is used in both the map and reduce phases.函数是用来创建元数据,这个函数是用于地图,减少阶段。

A bigger problem arises with the original data. 一个更大的问题出现了与原数据。In particular, Step 6 of the above framework re-creates the original records, but it has lost the column names, which poses a conundrum. 特别是,第6步的上述框架re-creates原始记录,但它失去了列名称,这带来了一个难题。The solution is to save the original metadata in Step 1, which first reads the records. 解决的办法是保存原始数据在第一步先读取记录。This metadata is then passed into Step 6.这些元数据被转送到第6步。

In this code, this is handled by simply using a file. 在这个代码,这是由简单地用一个文件。The first map partition of Step 1 writes this file (this partition is used to guarantee that the file is written exactly once). 第一个图分区第1步写文件(这隔断是用来保证文件的书面恰好一次)。The last reduce in Step 6 then reads this file.最后在第六步然后减少读取这个文件。

This mechanism works, but is not actually the preferred mechanism, because all the reduce tasks in Step 6 are competing to read the same file -- a bottleneck.这一机制的艺术品,但事实上不是首选的机制,因为所有的减少工作步骤6正争相看相同的文件——一个瓶颈。

A better mechanism is for the master program to read the file and to place the contents in variables in the jar file passed to the map reduce tasks. 一个更好的机制是对于硕士阅读文件,把内容的jar文件变量传递给地图减少任务。Although I do this for other variables, I don't bother to do this for the file.尽管我做这些是为了其他变数,我也不愿麻烦去做这个文件。

Posted by发布 Gordon S. LinoffLinoff戈登s .at 9:39 AM9:39是 0 comments0评论 Links to this post连接到这个帖子

Sunday, December 27, 2009周日,2009年12月27日

Hadoop and MapReduce: Characterizing DataHadoop和MapReduce:描述数据

This posting describes using Hadoop and MapReduce to characterize data -- that is, to summarize the values in various columns to learn about the values in each column.本文描述了利用Hadoop表征和MapReduce数据——也就是说,总结各种柱值学习的价值在于每一栏。

This post describes how to solve this problem using Hadoop. 这篇文章描述了如何使用Hadoop解决这个问题。It also explains why Hadoop is better for this particular problem than SQL.它也解释了为什麽Hadoop更好为这个特殊问题比SQL。

The code discussed in this post is available in these files:在这篇文章讨论的代码是可利用的在这些文件:GenericRecordMetadata.javaGenericRecordMetadata.java,GenericRecord.javaGenericRecord.java,GenericRecordInputFormat.javaGenericRecordInputFormat.java, and,Characterize.javaCharacterize.java.This work builds on the classes introduced in my previous post这项工作是建立在类的帖子了Hadoop and MapReduce: Method for Reading and Writing General Record StructuresHadoop和MapReduce:方法一般记录阅读和写作结构(the versions here fix some bugs in the earlier versions).(在这里修一些错误的版本在较早的版本)。

What Does This Code Do?这个代码做什么?

The purpose of this code is to provide summaries for data in a data file. 这个代码的目的是提供数据总结为一个数据文件。Being Hadoop, the data is stored in a delimited text format, with one record per line, and the code uses被Hadoop,数据存储在一个分隔的文本格式,每一张唱片行,和代码的用途 GenericRecordGenericRecordto handle the specific data. 处理具体的数据。The generic record classes are things that I wrote to handle this situation; the Apache java libraries apparently have other approaches to solving this problem.通用类事记录我写信给处理这种情形;爪哇图书馆鞍鞯明显得到别的方法,解决了这个问题。

The specific summaries for each column are:为每一列的具体总结如下:
  • Number of records.数量的记录。
  • Number of values.数值。
  • Minimum and maximum values for string variables, along with the number of times the minimum and maximum values appear in the data.最大和最小值字符串变量,随着次数的最大和最小值的出现在数据。
  • Minimum and maximum lengths for string variables, along with the number of times these appear and an example of the value.最小和最大长度字符串变量,随着这些出现的次数,并通过实例的价值。
  • First, second, and third most common string values.第一、第二和第三最常见的字符串值。
  • Number of times the column appears to be an integer.柱的次数似乎是整数。
  • Minimum and maximum values when treating the values as integers, along with the number of times that these appear.当治疗最大和最小值的值为整数,随着这些出现的次数。
  • Number of times the column appears to contain a real number.柱的次数似乎包含一个真正的号码。
  • Minimum and maximum values when treating the values as doubles, along with the number of times that these appear.当治疗最大和最小值的值为双打,随着的次数,这些出现。
  • Count of negative, zero, and positive values.计数底片,零,正面的价值观。
  • Average value.平均值。
These summaries are arbitrary. 这些摘要是任意的。The code should be readily extensible to other types and other summaries.代码应该是容易的扩展到其他类型和其它概要。

My ultimate intention is to use this code to easily characterize input and result files that I create in the process of writing Hadoop code.我的最终目的是使用这个代码输入和结果容易描述文件建立在写作过程中Hadoop代码。


Overview of the Code概述的代码

The characterize problem is solved in two steps. 问题是解决的描述两个步骤。The first creates a histogram of all the values in all the columns, and the second summarizes the histogram of values, which is handled by two passes of map reduce.第一个创建一个直方图的所有列值,第二总结了直方图的价值观,这是由两步地图减少。

The histogram step takes files with the following format:直方图文件步骤以下列格式:
  • Key: undetermined重点:待定
  • Values: text values separated by a delimited (by default a tab)价值观:文本价值被分隔的(默认情况下一个标签)
(This is the(这是 GenericRecordGenericRecordformat.)格式。)
The Map phase produces a file of the format:地图阶段产生文件的格式:
  • Key: column name and column value, separated by a colon重点:列名和列值,一个冒号分隔
  • Value: "1"价值:“1”
Combine and Reduce then add up the "1"s, producing a file of the format:结合,降低然后加上“1”的年代,产生一个文件的格式:
  • Key: column name重点:列名称
  • Value: column value separated by tab价值:柱价值被标签
Using a tab as a separator is a convenience, because this is also the default separator for the key.使用一个标签作为一个分离器是一种方便,因为这也是默认的分离器钥匙。

The second phase of the Map/Reduce job takes the previous output and uses the reduce function to summarize all the different values in the histogram. 第二阶段的地图/减少的工作需要先前的产量和使用功能,总结降低所有不同的值直方图。This code is quite specific to the particular summaries. 这段代码相当具体到特定的概要。The GenericRecordGenericRecordformat is quite useful because I can simply add new summaries in the code, without worrying about the layout of the records.格式是很有用的,因为我只可以添加新的总结在代码,而不担心布局的记录。

The code makes use of exception processing to handle particular data types. 代码利用例外处理处理特定的数据类型。For instance, the following code block handles the integer summaries:例如,下面的代码块处理整数的总结:

try {
....long intval = Long.parseLong(valstr);
....hasinteger = true;
....intvaluecount++;
....intrecordcount += Long.parseLong(val.get("count"));
}
catch (Exception NumberFormatException) {
....// we don't have to do anything here
}

This block tries to convert the value to an integer (actually to a long). 这条街试图转换价值成一个整数(其实对于一个长)。When this works, then the code updates the various variables that characterize integer values. 当这个作品,那时的代码更新各种变量,以显示值。整数When this fails, the code continues working.当这失败,代码连续。

There is a similar block for real numbers, and I could imagine adding more such blocks for other formats, such as dates and times.有一个类似的块真正的数目,并且我可以想象这样的块增加更多其他格式,如日期和时间。

Why MapReduce Is Better Than SQL For This Task为什么MapReduce比SQL为这个任务吗

Characterizing data is the process of summarizing data along each column, to get an idea of what is in the data. 描述数据的过程中总结数据沿每个专栏,为了解什么是数据。Normally, I think about data processing in terms of SQL (after all, my most recent book is通常,我想起数据处理方面的SQL(毕竟,我最近的书Data Analysis Using SQL and Excel资料分析使用SQL,Excel). )。SQL, however, is particularly poor for this purpose.SQL,然而,特别低的作用。

First, SQL has precious few functions for this task -- basically首先,SQL有珍贵的一些功能,基本上是在这个任务上 MIN()分钟(), MAX()马克斯(), AVG()美国航空志愿队(飞虎队)()and judicious use of the和明智地使用 CASE案例statement. 声明。Second, SQL generally has lousy support for string functions and inconsistent definitions for date and time functions across different databases.第二,SQL一般都有糟糕的支持字符串函数的定义和不一致的日期和时间功能在不同的数据库。

Worse, though, is that traditional SQL can only summarize one column at a time. 更糟糕的是,尽管是传统SQL只能总结一个列向量。The traditional SQL approach would be to summarize each column individually in a query and then connect them using传统的SQL的方法就是总结每一列都单独用查询,然后连接用 UNION ALL联盟所有statements. 报表。The result is that the database has to do a full-table scan for each column.这样做的结果是有这样一个full-table数据库扫描每一栏。

Although not supported in all databases, SQL syntax does now support the虽然不支持所有数据库SQL的语法现在做的支持 GROUPING SETS分组套keyword which helps potentially alleviate this problem. 帮助减轻这些字的潜在问题。However,然而, GROUPING SETS分组套is messy, since the key columns each have to be in separate columns. 是凌乱,因为每个键字段必须在不同的栏目。That is, I want the results in the format "column name, column value". 那就是我想要的结果,在格式”栏名称、柱价值”。With GROUPING SETS分组套, I get "column1, column2 ... ,我知道,”column1 column2…columnN", with NULLs for all unused columns, except for the one with a value.columnN”,为所有未使用NULLs栏目,除了一个与价值。

The final problem with SQL occurs when the data starts out in text files. 最后的问题是数据结构化查询语言(SQL)开始在文本文件。Much of the problem of characterizing and understanding the data happens outside the database during the load process.大部分的问题,理解数据表征外在发生在装车过程中。
Posted by发布 Gordon S. LinoffLinoff戈登s .at 12:48 PM48点 0 comments0评论 Links to this post连接到这个帖子

Friday, December 18, 2009周五,12月18日,

Hadoop and MapReduce: Method for Reading and Writing General Record StructuresHadoop和MapReduce:方法一般记录阅读和写作结构

I'm finally getting more comfortable with Hadoop and java, and I've decided to write a program that will characterize data in parallel files.我终于得到更加舒适和java的Hadoop,我决定写一个程序,将鉴定并行数据文件。

To be honest, I find that I am spending a lot of time writing new老实说,我发现我花了很多时间写新的 Writable可写的and InputFormatInputFormatclasses, every time I want to do something. 课程,每次我想做点什么。Every time I introduce a new data structure used by the Hadoop framework, I have to define two classes. 每次我介绍一个新的数据结构Hadoop使用框架,我必须定义两个阶级。Yucch!Yucch !

So, I put together a simple class called所以,我加在一起,一个简单的类调用 GenericRecordGenericRecordthat can store a set of column names (as string) and a corresponding set of column values (as strings). 储存一个列的集合的名字(字符串)和一个列的集合的相应的价值(如为字符串)。These are stored in delimited files, and the various classes understand how to parse these files. 这些都是储存在分隔的文件,并且各班了解如何解析这些文件。In particular, the code can read any tab delimited file that has column names on the first row (and changing the delimitor should be easy). 特别是,代码可以读任何制表符分隔的文件,在第一排列名称(改变delimitor应该很容易)。One nice aspect is the ability to use the一个好的方面是。使用能力 GenericRecordGenericRecordas the output of a reduce function, which means that the number and names of the output can be specified in the code -- rather than in additional files with additional classes.作为输出功能的降低,这意味着这个号码和名字的输出可以被指定在代码,而不是在额外的文件有额外的课程。

I wouldn't be surprised if similar code already exists with more functionality than the code I have here. 我不会惊讶代码已经存在有更多类似的代码功能比我在这里。This effort is also about my learning Hadoop.这种努力学习Hadoop也是关于我。

This posting provides the code and explains important features on how it works. 本文提供的代码和说明是怎样运作的重要特征。The code is available in these files代码可在这些文件GenericRecord.javaGenericRecord.java,GenericRecordMetadata.javaGenericRecordMetadata.java,GenericRecordInputFormat.javaGenericRecordInputFormat.java, and,GenericRecordTester.javaGenericRecordTester.java.

What This Code Does这段代码是什么

This code is analogous to the word count code, that must be familiar to anyone starting to learn MapReduce (since it seems to be the first example in all the documentation I've seen). 本代码是类似于单词计数代码,必须熟悉的人开始学习MapReduce(因为它似乎是第一个我看过所有的文件)。Instead of counting words, this code counts the occurrence of values in the columns.而不是计数的话,这个代码计数价值观发生的栏目。

The code reads input files and produces output records with three columns:代码读取输入文件并产生输出记录与三列:
  • A column name in the original data.一个字段名在原始数据。
  • A value in the column.一个值在专栏。
  • The number of times the value appears.价值的次数就会出现。
Do note that for data with many unique values in many columns, the number of output records is likely to far exceed the number of input records. 注意,对数据做许多独特的价值观在许多栏目,数目输出记录可能远远超过输入记录的数量。So, the output file can be bigger than the input file.所以,输出文件可以比输入文件。

The input records are assumed to be in a text file with one record per row. 输入记录被认为是在一个文本文件与每一行的一条记录。The first row contains the names of the columns, delimited by a tab (although this could easily be changed to another delimiter). 第一行包含的名字列,由一个标签分隔了(尽管这很容易被改变到另一个分隔符)。The rest of the rows contain values. 其余的行包含的价值观。Note that this assumes that the input files are all read from the beginning; that is, that a single input file is not split among multiple map tasks.注意这个假设输入文件都读从一开始,也就是说,一个单一输入文件不拆分在多种地图任务。

One irony of this code and the Hadoop framework is that the input files do not have to be in the same format. 一个讽刺,这段代码和Hadoop框架是输入文件不需要在相同的格式。So, I could upload a bunch of different files, with different numbers of columns, and different column names, and run them all in parallel. 所以,我可以上传一群不同的文件,与不同数量的专栏和不同栏目的名字,再运行在并行。I would have to be careful that the column names are all different, for this to work well.我将不得不小心柱名字都是不同的,因为这很好地工作。

Examples of such files are available on the companion page for my book这些文件的例子可以在同伴页为我的书Data Analysis Using SQL and Excel资料分析使用SQL,Excel.These are small files by the standards of Hadoop (measures in megabytes) but quite sufficient for testing and demonstrating code.这些都是小文件标��,Hadoop(措施的字节),但足够供试验,并展示代码。


Overview of Approach概要的方法

There are four classes defined for this code:有四类定义了这个代码:
  • GenericRecordMetadataGenericRecordMetadatastores the metadata (column names) for a record.储存元数据(列名称)为一个记录。
  • GenericRecordGenericRecordstores the values for a particular record.商店的价值观为一个特定的记录。
  • GenericRecordInputFormatGenericRecordInputFormatprovides the interface for reading the data into Hadoop.提供接口读取数据转换成Hadoop。
  • GenericRecordTesterGenericRecordTesterprovides the functions for the MapReduce framework.MapReduce提供功能框架。
The metadata consists of the names of the columns, which can be accessed either by a column index or by a column name. 元数据是由柱的名称,它可以被存取无论是列索引或列名。The metadata has functions to translate a column name into a column index. 元数据转换的滑爽列名分为一栏的索引。Because it uses a因为它使用一个 HashMapHashMap, the functions should run quite fast, although they are not optimal in memory space. 系统的功能特点、能跑很快,虽然他们不是最优的内存空间。This is okay, because the metadata is stored only once, rather than once per row.这是好的,因为只有一次元数据存储,而不是每一排。

The generic record itself stores the data as an array of strings. 通用的记录本身储存数据作为数组对象。It also contains a pointer to the metadata object, in order to fetch the names. 它也包含一个指针指向元数据对象,为了取名字。The array of strings minimizes both memory overhead and time, but does require access using an integer. 阵列串了内存开销最小化和时间,但却需要接触使用整数。The other two classes are needed for the Hadoop framework.其他两类Hadoop所需的框架。

One small challenge is getting this to work without repeating the metadata information for each row of data. 一个小的挑战是让这工作而不重复的元数据信息为每一行数据。This is handled by including the column names as the first row in any file created by the Hadoop framework, and not by putting the column names in the output for each row.这是由包括列名称作为第一排在任何文件由Hadoop框架,而不是把列名称为每一行输出。


Setting Up The Metadata When Reading阅读时建立元数据

The class班上 GenericRecordInputFormatGenericRecordInputFormatbasically does all of its work in a private class called基本上所有的工作是在一家私人类调用 GenericRecordRecordReaderGenericRecordRecordReader.This function has two important functions:这个函数有两个重要功能: initialize()初始化()and nextKeyValue()nextKeyValue().

The initialize()初始化()function sets up the metadata, either by reading environment variables in the context object or by parsing the first line of the input file (depending on whether or not the environment variable函数设置的元数据,或者通过上下文阅读环境变量的对象或分解输入文件的第一行(与否取决于环境变量 genericrecord.numcolumnsgenericrecord.numcolumnsis defined). 定义)。I haven't tested passing in the metadata using environment variables, because setting up the environment variables poses a challenge. 我没有测试过通过元数据使用环境变量,因为设置环境变量提出了挑战。These variables have to be set in the master routine in the configuration before the map function is called.这些变量是设置在主程序在配置在图函数被调用。

The nextKeyValue()nextKeyValue()function reads a line of the text file, parses it using the function函数读取线的文本文件,然后用解析功能 split()分解, and sets the values in the line. ,并规定了价值的可能性。The verification on the number of items read matching the number of expected items is handled in the function核查数量匹配的物品的数量阅读项目��预期处理功能 lineValue.set()lineValue.set(), which raises an exception (currently unhandled) when there is a mismatch.,这引发了一个例外(目前未处理的)当有一个不匹配。


Setting Up The Metadata When Writing写作时建立元数据

Perhaps more interesting is the ability to set up the metadata dynamically when writing. 也许更有趣的是能够建立元数据动态写作。This is handled mostly in the这是大多在处理 setup()设置()function of the功能的 SplitReduceSplitReduceclass, which sets up the metadata using various function calls.类来使用,这是树立元数据使用各种不同的函数调用。

Writing the column names out at the beginning of the results file uses a couple of tricks. 在写这列名称的开始使用结果文件几个技巧。First, this does not happen in the第一,这不会发生 setup()设置()function but rather in the而在功能 reduce()减少()function itself, for the simple reason that the latter handles函数本身,原因很简单,后者手柄 IOExceptionIOException.

The second trick is that the metadata is written out by putting it into the values of a第二个秘诀就是,政府信息资源元数据是通过把它写出来的价值 GenericRecordGenericRecord.This works because the values are all strings, and the record itself does not care if these are actually for the column names.这部作品因为价值观都是串,和记录本身并不在意这些的确是列名称。

The third trick is to be very careful with the function第三个恶作剧是必须非常小心的作用 GenericRecord.toString()GenericRecord.toString().Each column is separated by a tab character, because the tab is used to separate the key from the value in the Hadoop framework. 每一栏是制表符分隔开的,因为标签是用来分离的关键Hadoop从价值的框架。In the reduce output files, the key appears first (the name of the column in the original data), followed by a tab -- as put there by the Hadoop framework. 在减少输出文件,关键是首次出现(名字的原始数据列),其次是一个标签,放在那里的Hadoop框架。Then,然后, toString()adds the values separated by tabs. 加值被制表符。The result is a tab-delimited file that looks like column names and values, although the particular pieces are put there through different mechanisms. 结果是一个tab-delimited文件看起来像列名称和价值观,虽然特定的片段放在那儿,通过不同的机制。I imagine that there is a way to tell Hadoop to use a different character to separate the key and value, but I haven't researched this point.我想象有一个办法告诉Hadoop使用一个不同的字符来分离的关键和价值,但我还没研究这一点。

The final trick is to be careful about the ordering of the columns. 最后的秘诀就是要小心,订购的石柱。The code iterates through the values of the代码遍历的价值观 GenericRecordGenericRecordtable manually using an index rather than a手动需要通过一个索引表,而不是一个 for-infor-inloop. 循环。This is quite intentional, because it allows the code to control the order in which the columns appear -- which is presumably the original ordered in which they were defined. 这是很有心,因为它允许代码来控制顺序中显示的列——大概是原来的命令,他们被定义。Using the利用 for-infor-inis also perfectly valid, but the columns may appear in a different order (which is fine, because the column names also appear in the same order).也非常有效,但列出现在一个不同的订购(这是好的,因为列名称也出现在相同的顺序)。

The result of all this machinery is that the reduce function can now return values in a结果这一切的机械是减少函数返回值在现在 GenericRecordGenericRecord.And, I can specify these in the reduce function itself, without having to mess around with other classes. 而且,我可以指定这些在减少函数本身,而无需摆弄其他的课程。This is likely to be a big benefit as I attempt to develop more code using Hadoop.这可能是一个大效益为我尝试开发利用Hadoop更多的代码。
Posted by发布 Gordon S. LinoffLinoff戈登s .at 6:24 PM24点 1 comments1评论 Links to this post连接到这个帖子

Tuesday, December 15, 2009星期二,2009年12月15日

Hadoop 0.20: Creating TypesHadoop 0.20:创造类型

In various earlier posts, I wrote code to read and write zip code data (which happens to be part of the companion page to my book文章在各种早些时候,我写代码来读和写邮编资料(碰巧是部分的同伴页,我的书Data Analysis Using SQL and Excel资料分析使用SQL,Excel). )。This provides sample data for use in my learning Hadoop and mapreduce.这提供了样本数据用于我的学习Hadoop和mapreduce。

Originally, I wrote the code using Hadoop 0.18, because I was using the Yahoo virtual machine. 原来,我写的代码使用Hadoop 0.18,因为我正在使用雅虎的虚拟机。I have since switched to the Cloudera virtual machine, which runs the most recent version of Hadoop, V0.20.我已经转移到Cloudera虚拟机,这是最近的版本的Hadoop,V0.20。

I thought switching my code would be easy. 我想我的代码可以轻易切换。The issue is less the difficulty of the switch, then some nuances in Hadoop and java. 这个问题是少的难度,开关,然后一些细微差别,在Hadoop和java的。This post explains some of the differences between the two versions, when adding a new type into the system. 这篇文章解释为什麽有些两个版本之间的差异,当添加一个新的类型到系统中。I explained my experience with the map, reduce, and job interface in another我的解释我的经验与地图,减少、工作界面,在另一个地方post岗位.

The structure of the code is simple. 结构的代码很简单。I have a java file that implements a class called我有一个java文件,实现了一个类调用 ZipCode以下, which contains the ZipCode interface with the Writable interface (which is I include using,其中包含以下可写的接口的接口(我把使用 import org.apache.hadoop.io.*org.apache.hadoop.io *进口。). )。Another class called另一个类的称为 ZipCodeInputFormatZipCodeInputFormatimplements the read/writable version so实行读/可写的版本 ZipCode以下can be used as input and output in MapReduce functions. 可作为输入和输出在MapReduce功能。The input format class uses another, private class called输入格式使用另一个班,私人类调用 ZipCodeRecordReaderZipCodeRecordReader, which does all the work. 而来,并所有的工作。Because of the rules of java, these need to be in two different files, which have the same name as the class. 由于爪哇的规则,这些都需要在两个不同的文件,这些文件具有相同的名称作为这个班。The files are available in文件是可得到的ZipCensus.javaZipCensus.javaandZipCensusInputFormat.javaZipCensusInputFormat.java.

These files now use the Apache mapreduce interface rather than the mapred interface, so I must import the right packages into the java code:这些文件现在使用架阿帕奇mapreduce界面,而不是mapred接口,所以我必须进口正确的包装到java代码:

import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.InputSplit;


And then I had a problem when defining the然后我遇到一个问题,可以定义 ZipCodeInputFormatZipCodeInputFormatclass using the code:课堂使用代码:

public class ZipCensusInputFormat extends FileInputFormat
{
....public RecordReader
createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {
........return new ZipCensusRecordReader();
....} // RecordReader

} // class ZipCensusInputFormat


The specific error given by Eclipse/Ganymede is: "日食的具体错误赋予/格尼梅德星是:“The type org.apache.commons.logging.Log cannot be resolved. org.apache.commons.logging.Log类型不能解决。It is indirectly referenced from required .这是间接地从要求。参考class files.类文件。" This is a bug in Eclipse/Ganymede, because the code compiles and runs using javac/jar. “这是一个错误,因为日食/格尼梅德星代码编译和运行使用javac /罐。At one point, I fixed this by including various Apache commons jars. 一下子,我固定的下议院包括各种鞍鞯坛子。However, since I didn't need them when compiling manually, I removed them from the Eclipse project.然而,因为我不需要他们手工编制,我取下他们从月蚀的项目。

The interface for the RecordReader class itself has changed. RecordReader接口的类本身已经改变了。The definition for the class now looks like:定义的职业现在看起来像:

class ZipCensusRecordReader extends RecordReader类ZipCensusRecordReader延伸RecordReader

Previously, this used the syntax "以前,这用语法”implements实施" rather than "”,而不是“extends延伸". ”。For those familiar with java, this is the difference between an interface and an abstract class, a nuance I don't yet fully appreciate.对于那些熟悉爪哇,这是区别界面和一个抽象类,一个细微的我还没充分欣赏。

The new interface (no pun intended) includes two new functions,新界面(没有双关)包括两个新功能, initialize()初始化()and cleanup()清理().I like this change, because it follows the same convention used for map and reduce classes.我喜欢这种改变,因为它遵循相同的公约,减少用于地图课。

As a result, I changed the constructor to take no arguments. 作为一个结果,我改变了构造器采取任何参数。This has moved to这已搬到 initialize()初始化(), which takes two arguments of type以两个参数的类型 InputSplitInputSplitand TaskAttemptContextTaskAttemptContext.The purpose of this code is simply to skip the first line of the data file, which contains column names.这段代码的目的只是为了跳过第一线的数据文件,其中包括列名称。

The most important for the class is now called最重要的是现在所谓的职业 nextKeyValue()nextKeyValue()rather than而不是 next()下一个().The new function takes no arguments, putting the results in local private variables accessed using新函数没有参数,把结果变量在当地的私人访问使用 getCurrentKey()getCurrentKey()and getCurrentValue()getCurrentValue().The function功能 next()下一个()took two arguments, one for the key and one for the value, although the results could be accessed using the same two functions.把两个参数,一个关键和价值,但是结果可能会是访问使用相同的两个函数。

Overall the changes are simple modifications to the interface, but they can be tricky for the new user. 整体的变化是简单的修改界面,但他们也会是棘手的新用户。I did not find a simple explanation for the changes anywhere on the web; perhaps this posting will help someone else.我没有找到一个简单的解释更改的地方网络,这也许你要帮助别人。
Posted by发布 Gordon S. LinoffLinoff戈登s .at 3:07 PM3:07点 0 comments0评论 Links to this post连接到这个帖子

Saturday, December 5, 2009周六,12月5日,

Hadoop and MapReduce: What Country is an IP Address in?Hadoop和MapReduce是什么国家一个IP地址?

I have started using Hadoop to sessionize web log data. 我已经开始使用Hadoop到sessionize网络日志的数据。It has surprised me that there is not more written on this subject on the web, since I thought this was one of the more prevalent uses of Hadoop. 它让我惊讶,有更多的写在这个主题在网页上,因为我认为这是一个更普遍Hadoop用途。Because I'm doing this work for a client, using Amazon EC2, I do not have sample data web log data files to share.因为我做这项工作为一个客户,用亚马逊我局,我没有样本数据网络日志的数据文件分享。

One of the things that I want to do in the sessionization code is to include what country the user is in. Typically, the only source of location information in such logs is the IP address used for connecting to the internet. 第一件事我想在sessionization代码包括用户在哪个国家。一般来说,位置信息的唯一来源在木材的IP地址是用来连接到因特网。How can I look up the country the IP address is in?我怎么能查国家的IP地址?

This posting describes three things: the source of the IP geography information, new things that I'm learning about java, and how to do the lookup in Hadoop.本文描述了三件事:源激地理信息、新事物,我学习java的,如何在Hadoop查找。


The Source of IP Geolocation InformationIP指引之源信息

MaxMindMaxMindis a company that has a specialty in geolocation data. 公司是一家专业的地理定位数据。I have no connection to MaxMind, other than a recommendation to use their software from someone at the client where I have been doing this work. 我没有连接到MaxMind,推荐使用他们的软件,在客户端从某人那里我已经在做这项工作。There may be other companies with similar products.可能还有其他的公司与同类产品。

One way they make money by offering a product called GeoIp Country which has very, very accurate information about the country where an IP is located (they also offer more detailed geographies, such as regions, states, and cities, but country is sufficient for my purposes). 他们赚钱的一种方法提供一种产品称为GeoIp国家已经非常、非常的准确信息的国家中,一个IP的所在地(他们还提供更详细的地域,如地区、州和城市,但国家是够我的目的)。Their claim is that GeoIP Country is 99.8% accurate.他们的要求是GeoIP国家99.8%的准确率。

Although quite reasonably priced, I am content to settle for the free version, called GeoLite Country, for which the claim is 99.5% accuracy.尽管很价格合理,我甘心接受了免费版,称为GeoLite国家,宣称准确率达99.5%。

These products come in two parts. 这些产品包括两部分。The first part is an interface, which is available for many languages, with the java version第一部分是一个接口,这是可为多种语言,用java版本here这里.I assume the most recent version is the best, although I happen to be using an older version.我认为最近的版本是最好的,但我碰巧是使用一个更老的版本。

Both the free and paid versions use the same interface, which is highly convenient, in case I want to switch between them. 两个免费和付费的版本使用相同的接口,它是非常方便,如果我想他们之间切换。The difference is the database, which is available from不同的是数据库,这是可从thisdownload page. 下载页面。The paid version has more complete coverage and is updated more frequently.有偿版本有更全面的报道和更新的越来越频繁。

The interface consists of two important components:界面由两个重要的组成部分:
  • Creating a创造一个 LookupServiceLookupServiceobject, which is instantiated with an argument that names the database file.对象,以一个参数:名字实例化数据库文件。
  • Using使用 LookupService.getCountry()LookupService.getCountry()to do the lookup.做查找。
Simple enough interface; how do we get it to work in java, and in particular, in java for Hadoop?简单的界面;我们怎样让它工作在爪哇,特别是在爪哇为Hadoop吗?


New Things I've Learned About Java我了解到新事物的爪哇

As I mentioned a few weeks ago in my first正如我所提到的几个星期前的一天,在我的第一个post岗位on learning Hadoop, I had never used java prior to this endavor (although I am familiar with other object oriented programming languages such as C++ and C#). 学习Hadoop,我从来没有使用java此前endavor(虽然我熟悉其他的面向对象的程序设计语言如C + +和c#)。I have been learning java on an "as needed" basis, which is perhaps not the most efficient way overall but has been the fastest way to get started.我一直在学java的“必要”基础,也许不是最有效的方法,但已经全面的最快的方式开始了。

When programming java, there are two steps. java编程的时候,有两个步骤。I am using the我在使用 javacjavaccommand to compile code into class files. 命令编译代码分成类文件。Then I'm using the然后我用了 jar罐子command to create a jar file. 命令来创建一个jar文件。I have been considering this the equivalent of "compiling and linking code", which also takes two steps.我一直在考虑这相当于“编译和链接码”,这也需要两个步骤。

However, the jar file is much more versatile than a regular executable image. 然而,更通用的jar文件是比普通可执行的形象。In particular, I can put特别的,我都可以摆放 any任何files there. 文件。These files are then available in my application, although java calls them "resources" instead of "files". 这些文件,然后可以在我的申请,虽然java称之为“资源”代替“文件”。This will be very important in getting MaxMind's software to work with Hadoop. 这将是非常重要的软件获得MaxMind工作与Hadoop。I can include the IP database in my application jar file, which is pretty cool.我可以包括IP数据库在我的申请jar文件,这很酷。

There is a little complexity, though, which involves the paths of where there are located. 有一个小的复杂性,虽然,这牵涉到哪里有道路的所在地。When using hadoop, I have been using statements such as "当使用hadoop,我一直在使用报表,如“org.apache.hadoop.mapreduceorg.apache.hadoop.mapreduce" without really understand them. “没有真正了解他们。This statement brings in classes associated with the mapreduce package, because three things have happened:这种说法带来相关课堂mapreduce包装,因为三件事情已经发生了。
  • The original work (at apache) was done in a directory structure that included原来的工作(在架阿帕奇)是在一个目录结构,包括在内 ./org/apache/hadoop/mapreduce/ org/apache/hadoop/mapreduce。.
  • The tar file was created in that (higher-level) directory. 焦油档案成立于那(高级)目录。Note that this could be buried deep down in the directory hierarchy. 注意,这可能是深埋在目录下的层次。Everything is relative to the directory where the tar file is created.一切都是相对的,焦油档案的目录就产生了。
  • I am including that tar file explicitly in my我是包括在我的文件明确焦油 javacjavaccommand, using the命令,使用 -cp-cpargument which specifies a class path.一个类的观点指定的路径。
All of this worked without my having to understand it, because I had some examples of working code. 这一切工作而不用我理解,因为我有一些例子,工作代码。The MaxMind code then poses a new problem. MaxMind的代码然后提出了新的问题。This is the first time that I have to get someone else's code to work. 这是第一次,我还得别人的代码来工作。How do we do this?我们该怎么做?

First, after you uncompress their java code, copy the第一,在你解压缩他们的代码爪哇,复制 comcomdirectory to the place where you create your java jar file. 目录的地方,你创造你的java的jar文件。Actually, you could just link the directories. 其实,你可以连接目录。Or, if you know what you are doing, then you may have another solution.或者,如果你知道你在做什么,那么你可能有另一种解决办法。

Next, for compiling the files, I modified the其次,对于编译文件,我修改了 javacjavaccommand line, so it read:命令行,上面写着: javacjavac -cp .:/opt/hadoop/hadoop-0.20.1-core.-cp。/选择/ hadoop-0.20.1-core hadoop。jar:com/maxmind/geoip [subdirectory]*.java维槽/钢罐/ * / * .java" [note: there is a space after "bin/"].“【注:有一个后的空间”的箱子里(不再理/”]。
  • Create the jar: "创造罐子里:“cd bin; jar ../cd槽;罐子。。 cvf **; cd ..罐子* / *,cd . .".
  • Run the code using the command: "运行代码使用的命令:“hadoop jar RowNumberTwoPass.RowNumberTwoPass hadoop罐子。jar RowNumberTwoPass/rownumbertwopass罐子RowNumberTwoPass / RowNumberTwoPass". ”。The first argument after "hadoop jar" is the jar file with the code. 第一个参数是“hadoop罐子”的jar文件代码。The second is the class and package where the二是在类和包装 main()主要的()function is located.函数的位置。
Although this seems a little bit complicated, it is only cumbersome the first time you run it. 尽管这看起来有点复杂,只有笨重的您第一次运行它。After that, you have the commands and running them again is simple.在那之后,你的诫命和运行他们又很简单。
Posted by发布 Gordon S. LinoffLinoff戈登s .at 9:49 PM9:49点 1 comments1评论 Links to this post连接到这个帖子

Wednesday, November 25, 2009周三,11月25日,

Hadoop and MapReduce: A Parallel Program to Assign Row NumbersHadoop和MapReduce:一个并行程式指定行数字

This post discusses (and solves) the problem of assigning consecutive row numbers to data, with no holes. 这篇文章讨论了(解决)这个问题的连续编号分配连续数据,没有孔。Along the way, it also introduces some key aspects of the Hadoop framework:一路上,还介绍了一些关键方面的Hadoop框架:
  • Using the FileSystem package to access HDFS (a much better approach than in my previous posting).使用的文件包进入HDFS(一种更好的方法比我上篇)。
  • Reading configuration parameters in the Map function.读取配置参数在地图的功能。
  • Passing parameters from the main program to the Map and Reduce functions.通过参数的主要程序从地图,减少功能。
  • Writing out intermediate results from the Map function.写出来的中间结果从地图上的功能。
These are all important functionality for using the Hadoop framework. 这些都是重要的,专为使用Hadoop框架。In addition, I plan on using this technique for assigning unique ids to values in various columns.除此之外,我计划在使用这个技术,独特的id分配价值在各种栏目。


The "Typical" Approach“典型的”方法

The "typical" approach is to serialize the problem, by creating a Reducer function that adds the row number. “典型的”方法是同步问题,创造了一个减速机功能,增加排号码。By limiting the framework to only a single reducer (using setNumReduceTasks(1) in the JobConf class), this outputs the row number.该框架通过限制只有单一的减速器(使用setNumReduceTasks(1)在JobConf类),该输出行数。

There are several problems with this solution. 有几个问题的解决方案。The biggest issue is, perhaps, aesthetic. 最大的问题是,也许,美学。Shouldn't a parallel framework, such as Hadoop, be able to solve such a simple problem? 难道平行结构,如Hadoop,能够解决这样一个简单的问题吗?Enforced serialization is highly inefficient, since the value of Hadoop is in the parallel programming capabilities enabled when multiple copies of maps and reduces are running.串行化是执行效率非常低下,因为Hadoop的价值是在并行程序设计能力使得当的多个副本的地图和减少正在运行。

Another issue is the output file. 另一个问题是输出文件。Without some manual coding, the output is a single file, which may perhaps be local to a single cluster node (depending on how the file system is configured). 没有一些手册编码,输出是一个单独的文件,这也许要到一个单一的集群本地节点(取决于文件系统配置)。This can slow down subsequent map reduce tasks that use the file.地图可以减慢后续任务,使用减少文件。


An Alternative Fully Parallel Approach另一种完全平行的方法


There is a better way, a fully parallel approach that uses two passes through the Map-Reduce framework. 有一个更好的道路,一个完全平行的方法,使用两种经过Map-Reduce框架。Actually, the full data is only passed once through the framework, so this is a much more efficient alternative to the first approach.实际上,完整的数据只有通过框架通过一次,所以这是一个更加有效的替代第一种方法。

Let me describe the approach using three passes through the data, since this makes for a simpler explanation (the actual implementation combines the first two steps).让我来说明一下该方法使用三个经过数据,因为这使得一个简单的解释的实际实施结合前两个步骤)。

The first pass through the data consists of a Map phase that assigns a new key to each row and no Reduce phase. 第一个通过数据由一个地图阶段,一个新的关键指定每一行并没有减少阶段。The key is consists of two parts: the partition id and the row number within the partition.关键是由两个部分组成:分区id和排数在隔断。

The second pass counts the number of rows in each partition, by extracting the maximum row number with each partition key.第二个通过计数的行数了每一个分割,提取最大行数而每个分区的钥匙。

These counts are then combined to get cumulative sums of counts up to each partition. 这些数目然后结合累积金额的计算得到每个分区。Although I could do this in the reduce step, I choose not to (which I'll explain below). 虽然我能做到这一步在减少,我选择不(我将解释下文)。Instead, I do the work in the main program.相反,我做这个工作的主要项目。

The third pass adds the offset to the row number and outputs the results. 第三,通过添加抵消排数和输出结果。Note that the number of map tasks in the first task can be different from the number in subsequent passes, since the code always uses the original partition number for its calculations.值得注意的是数量的地图任务的第一项任务可以是不同的数字在随后的流逝,由于代码总是用原始分区数计算。


More Detail on the Approach -- Pass 1更详细的方法——通过1

The code is available in this file代码是可利用的在这个文件RowNumberTwoPass.javaRowNumberTwoPass.java.It contains one class with two Map phases and one Reduce phase. 它包含一类配置两个地图阶段和一个减少阶段。This code assumes that the data is stored in a text file. 这个代码假设数据储存在一个文本文件。This assumption simplifies the code, because I do not have to introduce any auxiliary classes to read the data. 这种假设简化代码,因为我不需要任何辅助课程介绍阅读数据。However, the same technique would work for any data format.然而,同样的技术工作时间为任何的数据格式。

The first map phase,第一个图阶段, NewKeyOutputMapNewKeyOutputMap, does two things. ,做两件事。The simpler thing is to output the parition id and the row number within the partition for use in subsequent processing. 简单的事情是parition输出的id和排在分区数用于后续加工。The second is to save a copy of the data, with this key, for the second pass.二是节省拷贝的数据,这把钥匙,第二通过。

Assigning the Partition ID指派分区ID

How does any Map function figure out its partition id? 任何地图上如何找出其功能分区的身份证件吗?The partition id is stored in the job configuration, and is accessed using the code:隔墙id是储存在工作配置,可再使用代码:

....partitionid = conf.getInt("mapred.task.partition", 0);

In the version of Hadoop that I'm using (0.18.3, through the Yahoo virtual machine), the job configuration is only visible to a configuration function. 在版本的Hadoop我使用(0.18.3,通过雅虎虚拟机),工作配置是可见的,一个配置功能。This is an optional function that can be defined when implementing an instance of the这是一个可选的功能,可以定义在执行的一个实例 MapReduceBaseMapReduceBaseclass. 上课。It gets called once to initialize the environment. 这一次调用初始化环境。The configuration function takes one argument, the job configuration. 配置函数包含一个论据,工作配置。I just store the result in a static variable local to the我只是存储结果在一个静态的局部变量 NewOutputKeyMapNewOutputKeyMapclass.上课。

In more recent versions of Hadoop, the configuration is available in the context argument to the map function.在最近的版本的Hadoop,配置是可利用的在语境参数给地图功能。

Using Sequence Files in the Map Phase采用序列在地图的每个文件的阶段

The second task is to save the original rows with the new key values. 第二个任务是拯救原行与新键值。For this, I need a sequence file. 为此,我需要一个序列的文件。Or, more specifically, I need a different sequence file for each Map task. 或者,更确切地说,我需要一个不同的顺序文件为每个地图任务。Incorporating the partition id into the file name accomplishes this.结合分区id在达成这档案名称。

Sequence files are data stores specific to the Hadoop framework, which contain key-value pairs. 文件数据资料序列特定的Hadoop框架,包含关键字-值对。At first, I found them a bit confusing: Did the term "sequence file" refer to a collection of files available to all map tasks or to a single instance of one of these files? 起初,我发现他们有点令人迷惑:是术语“序列文件"指一组文件提供给所有地图任务或者一个单独的实例之一,这些文件吗?In fact, the term refers to a single instance file. 实际上,它是指一个单独的实例文件。To continue processing, we will actually need a collection of sequence files, rather than a single "sequence file".继续处理,我们将会需要一个集序列的文件,而不是一个单一的“序列文件”。

They are almost as simple to use as any other files, as the following code in the他们几乎像使用简单和其他文件,为下面的代码 configuration()结构()function shows:功能说明:

....FileSystem fs = FileSystem.get(conf);
....sfw = SequenceFile.createWriter(fs, conf,
........new Path(saverecordsdir+"/"+String.format("recordsd", partitionid.get())),
........Text.class, Text.class);

The first statement simply retrieves the appropriate file system for creating the file. 第一句简单的适当的文件系统中创建这个文件。The second statement uses the第二句的使用 SequenceFile.createWriter()SequenceFile.createWriter()function to open the file and save the id in the函数来打开文件和保存的id sfwsfwvariable. 变量。There are several versions of this function, with various additional options. 有几个版本的这个功能的时候,与各种各样的其他选项。I've chosen the simplest version. 我已经选好了简单的版本。The specific file will go in the directory referred to by the variable这个特定的文件将会在该目录下涉及到变量 saverecordsdirsaverecordsdir.This will contains a series of files with the names "records#####" where ##### is a five-digit, left-padded number.这将会包含一系列的文件的名字和《记录# # # # #在# # # # #是一个five-digit,left-padded号码。

This is all enclosed in try-catch logic to catch appropriate exceptions.这都是夹在try-catch逻辑捕捉合适的例外。

Later in the code, the在后来的代码 map()地图()writes to the sequence file using the logic:写文件使用的逻辑顺序。

....sfw.append(outkey, value);

Very simple!很简单的!

Pass1: Reduce and Combine FunctionsPass1:减少并结合功能

The purpose of the reduce function is to count the number of rows in each partition. 的目的是减少功能来计算各分区的行数。Instead of counting, the function actually takes the maximum of the partition row count. 而不是计数,功能确实需要最大的隔断行数。By taking this approach, I can use the same function for both reducing and combining.以这种方式,我可以用相同的功能,结合对于减少。

For efficiency purposes, the combine phase is very important to this operation. 为提高效率的目的,结合相是非常重要的,这个手术。The way the problem is structured, the combine output should be a single record for each map instance -- and sending this data around for the reduce phase should incur very little overhead.这个问题的方法是结构,结合单个记录输出应该对每一个地图,送上这份数据实例——在减少阶段应该招致非常小的开销。


More Detail on the Approach -- Offset Calculation and Pass 2更多的细节,并通过计算方法——抵消2

At the end of the first phase, the summary key result files contains a single row for each partition, containing the number of rows in each partition. 在结束了第一阶段,总结关键结果文件包含有一个单排为每个分区,包含了各分区的行数。For instance, from my small test data, the data looks like:例如,从我的小测试数据,这些数据看起来像:


00 22652265

11 22362236

22 33

The first column is the partition id, the second is the count. 第一栏是分区id,另一种是计数。The offset is the cumulative sum of previous values. 累计金额位移以前的值。So, I want this to be:因此,我想这是:


00 22652265 00

11 22362236 22652265

22 33 45014501

To accomplish this, I read the data in the main loop, after running the first job. 为了实现这个目标,我读到的数据,在主回路,运行后的第一份工作。The following loop in下面的循环 main()主要的()gets the results, does the calculation, and saves the results as parameters in the job configuration:得到的结果,并计算,节省了结果作为参数的工作结构:

....int numvals = 0;
....long cumsum = 0;
....FileStatus[] files = fs.globStatus(new Path(keysummaryoutput+ "/p*"));
....for (FileStatus fstat : files) {
........FSDataInputStream fsdis = fs.open(fstat.getPath());
........String line = "";
........while ((line = fsdis.readLine()) != null) {
............finalconf.set(PARAMETER_cumsum_nthvalue + numvals++, line + "\t" + cumsum);
............String[] vals = line.split("\t");
............cumsum += Long.parseLong(vals[1]);
........}
....}
....finalconf.setInt(PARAMETER_cumsum_numvals, numvals);

Perhaps the most interesting part of this code is the use of the function也许最有趣的部分代码使用功能 fs.globStatus()fs.globStatus()to get a list of HDFS files that match wildcards (in this case, anything that starts with "p" in the得到一个列表文件通配符HDFS比赛(在这种情况下,任何能让人们开始以“p” keysummaryouputkeysummaryouputdirectory).目录。


Conclusion结论

Parallel Map-Reduce is a powerful programming paradigm, that makes it possible to solve many different types of problems using parallel dataflow constructs.平行Map-Reduce是一个功能强大的编程范式,那使它可能解决许多不同类型的问题使用平行的数据结构。

Some problems seem, at first sight, to be inherently serial. 似乎一些问题,乍一看,是固有的系列。Appending a sequential row number onto each each row is one of those problems. 附加一个时序排数字放在每个每一行是其中的一个问题。After all, don't you have to process the previous row to get the number for the next row? 毕竟,你不用处理前连领取号码为下一排吗?And isn't this a hallmark of inherently serial problems?,这不是一个标志性的固有的系列问题吗?

The answers to these questions are "no" and "not always". 这些问题的答案是“不”和“不一定”。The algorithm described here should scale to very large data sizes and very large machine sizes. 这里描述的算法应该规模非常大,非常大的数据大小机器尺寸。For large volumes of data, it is much, much more efficient than the serial version, since all processing is in parallel. 对大量的数据,但它是非常、非常有效率,比系列的版本,因为所有的处理以并行方式进行。That is almost true. 那几乎是真实的。The only "serial" part of the algorithm is the calculation of the offsets between the passes. 唯一的“连续”的一部分,该算法计算之间的偏移的传球。However, this is such a small amount of data, relative to the overall data, that its effect on overall efficiency is negligible.然而,这是一个小数量的数据,相对于整个数据,综合效率的影响是可以忽略不计的。

The offsets are passed into the second pass using the这个偏差传递到第二个通过使用 JobConfigurationJobConfigurationstructure. 结构。There are other ways of passing this data. 还有别的方法可以通过这个数据。One method would be to use the distributed data cache. 一个方法是使用分布式数据缓存。However, I have not learned how to use this yet.然而,我没有学会了如何使用这呢。

Another distribution method would be to do the calculations in the first pass reduce phase (by using only one reducer in this phase). 另一个分配法是进行计算在第一阶段通过减少(仅用一个减速器在此阶段)。The results would be in a file. 这些结果将在一个档案中。This file could then be read by subsequent map tasks to extract the offset data. 这个文件可以读地图任务由后续提取抵消数据。However, such an approach introduces a lot of contention, because suddenly there will be a host of tasks all trying to open the same file -- contention that can slow processing considerably.然而,如此的一个方法介绍了很多争论,因为突然会有一大堆任务都试图打开相同的文件——认为能够减缓加工较大。
Posted by发布 Gordon S. LinoffLinoff戈登s .at 3:18 PM18点 0 comments0评论 Links to this post连接到这个帖子

Saturday, November 21, 2009星期六,年11月21日,

Hadoop and MapReduce: Controlling the Hadoop File System from theMapReduce ProgramHadoop和MapReduce:控制Hadoop文件系统从theMapReduce程序

[This first comment explains that Hadoop really does have a supported interface to the hdfs file system, though the FileSystem package ("import org.apache.hadoop.fs.FileSystem"). [这第一反应解释说,Hadoop确实一个支持的接口,文件系统,虽然hdfs文件包(“进口org.apache.hadoop.fs.FileSystem”)。Yeah! 是啊!I knew such an interface should exist -- and even stumbled across it myself after this post. 我知道这样一个界面应该存在的——甚至是偶然发现了这篇文章后,自己Unfortunately, there is not as simple an interface for the "cat" operation, but you can't have everything.]不幸的是,我们不是简单的一个接口用于“cat”的操作,但是你不能拥有所有的东西。)

In my previous在我之前的post岗位, I explained some of the challenges in getting a Hadoop environment up and running. ,我解释的一些挑战得到一个Hadoop环境的建立和运转。Since then, I have succeeding in using Hadoop both on my home machine and on Amazon EC2.从那时起,我就用Hadoop成功都在我家的机器和亚马逊我局。

In my opinion, one of the major shortcomings of the programming framework is the lack of access to the HDFS file system from MapReduce programs. 在我看来,其中最主要的程序框架的缺点是缺乏进入HDFS文件系统从MapReduce程序。More concretely, if you have attempted to run the WordCount program, you may have noticed that you can run it once without a problem. 更具体地说,如果你试图运行WordCount程序,你可能已经注意到,您可以运行一次没有问题。The second time you get an error saying that the output files already exist.你第二次说,得到一个错误输出文件已经存在。

What do you do? 你是做什么工作的?You go over to the machine running HDFS -- which may or may not be your development machine -- and you delete the files using the "hadoop fs -rmr" command. 你到对面的机器运行HDFS——这可能是也可能不是你的发展机——而你删除文件使用“hadoop fs -rmr”命令。Can't java do this?不能爪哇做这事吗?

You may also have noticed that you cannot see the output. 你也可能发现你不能看到输出。Files get created, somewhere. 文件得到创造,在某个地方。What fun. 多么有趣的事啊。To see them, you need to use the "hadoop fs -cat" command. 看到他们,你需要使用“hadoop fs -cat”命令。Can't java do this?不能爪哇做这事吗?

Why can't we create a simple WordCount program that can be run multiple times in a row, without error, and that prints out the results? 为什么我们不能创建一个简单的WordCount程序,可以多次运行在一排,没有错误,打印出的结果?And, to further this question, I want to do all the work in java. ,来促进这个问题,我想做所有的工作,在爪哇。I don't want to work with an additional scripting language, since I already feel that I've downloaded way too many tools on my machine to get all this to work.我不想去工作,一个额外的脚本语言,因为我已经感觉到我已经下载了很多工具,也在我的机器来得到这一切的工作。

By the way, I feel that both of these are very, very reasonable requests, and the hadoop framework should support them. 顺便说一句,我觉得这两种非常,非常的合理要求,hadoop框架应当支持他们。It does not. 它是不会的。For those who debate whether hadoop is better or worse than parallel databases, recognize that the master process in parallel databases typically support functionality similar to what I'm asking for here.对于那些辩论是否更好或者更坏hadoop是比并行数据库,认识到主人过程中典型地支持并行数据库功能类似我请求在这里。

Why is this not easy? 这是为什么不轻易?Java, Hadoop, and the operating systems seem to conspire to prevent this. 爪哇,Hadoop和操作系统似乎阴谋防止这种情况。But I like challenge. 但我喜欢挑战。This posting, which will be rather long, is going to explain my solution. 这个贴子,这将是相当长,要解释我的解决方案。Hey, I'll even include some code so other people don't have to suffer through the effort.嘿,我甚至会包括一些代码,以便别人不必遭受的努力。

I want to do this on the configuration I'm running from home. 我想做这个结构我跑步的家。This configuration consists of:这种配置包括:
  • Windows Vista, running EclipseWindows Vista、运行蚀
  • Ubuntu Linux virtual machine, courtesy of Yahoo!Linux浏览虚拟机,由雅虎!, running Hadoop 0.18、运行Hadoop 0.18
However, I also want the method to be general and work regardless of platform. 然而,我也希望这个方法大致和工作无论平台。So, I want it to work if I write the code directly on my virtual machine, or if I write the code on Amazon EC2. 所以,我想要它的工作,如果我写的代码直接在我的虚拟机,或者如果我写代码到亚马逊我局。Or, if I decide to use Karmasphere instead of Eclipse to write the code, or if I just write the code in a Java IDE. 或者,如果我决定要用Karmasphere代替蚀写代码,或者如果我刚写的代码爪哇IDE。In all honesty, I've only gotten the system to work on my particular configuration, but I think it would not be difficult to get it to work on Unix.实话实说,我只得到系统工作在我的特殊结构,但我认为它不会很难在Unix上工作。


Overview of Solution概要的解决方案


The overview of the solution is simple enough. 概述了的办法就是很简单。I am going to do the following:我要做到以下几点:
  • Create a command file called "myhadoopfs.创建一个命令文件被称为“myhadoopfs。bat" that I can call from java.“那我可以打电话蝙蝠从爪哇。
  • Write a class in java that will run this bat file with the arguments to do what I want.java编写一类能够开这种蝙蝠的参数文件,做我想做的。
Boy, that's simple. 男孩,那是简单的。NOT!不!

Here are a sample of the problems:这里有一个样本的问题:
  • Java has to call the batch file without any path. Java有打电话给该批处理文件,没有任何的路径。This is because Windows uses the backslash to separate directories whereas Unix uses forward slashes. 这是因为视窗使用反斜线分开目录而使用了Unix结果。I lose platform independence if I use full paths.我失去了平台独立性如果我使用完整路径。
  • The batch file has to connect to a remote machine. 该批处理文件必须连接到一个远程机器。Windows Vista does not have a command to do this. Windows Vista不需要命令来做这个。Unix uses the command "rsh".Unix使用命令”rsh”。
  • The java method for executing commands (Runtime.getRuntime().exec()) does not execute batch files easily.为执行命令的爪哇方法(.exec Runtime.getRuntime()()不执行批处理文件很容易。
  • The java method for executing commands hangs, after a few lines are output. 为执行命令的爪哇方法挂,经过几次线路的输出。And, the lines could be in either the standard output stream (stdout) or the error output stream (stderr), and it is not obvious how to read both of them at the same time.而且,细胞株或标准输出流(stdout)或错误输出流(stderr),这一点还不是很明显如何阅读他们俩在同一时间内。
This post is going to resolve these problems, step by step.这个职位是要解决这些问题,一步一步来。


What You Need你需要什么

To get started, you need to do a few things to your computer so everything will work.开始,你需要做几件事需要你的电脑,一切将会工作。

First, install the program PuTTY (from首先,安装程序油灰(here这里). )。Actually, choose the option for "A Windows installer for everything except PuTTYtel". 事实上,选择选项为“一个窗口,为一切除了PuTTYtel安装”。You can accept all the defaults. 你能接受所有的默认值。As far as I know, this runs on all versions of Windows.据我所知,这运行在所有版本的视窗。

Next, you need to change the system path so it can find two things by default:接下来,你需要改变系统路径,所以它可以找到两件事默认情况下:
  • The PuTTY programs.泥的程序。
  • The batch file you are going to write.该批处理文件你要写。
The system path variable specifies where the operating system looks for executable files, when you have a command prompt, or when you execute a command from java.该系统路径变数指定的地方看起来可执行文件的操作系统,当你有一个命令提示符,或当你执行一个命令从爪哇。

Decide on the directory where you want the batch file. 决定你想要的地方目录该批处理文件。I chose "c:\users\gordon".我选择" c:\ \戈登用户”。

To change the system path, to the the "My Computer" or "Computer" icon on your desktop and right click to get "Properties" and then choose "Advanced System Settings". 改变系统路径,对的“我的电脑”或者“计算机”图标,然后右键点击桌面上来让“属性”,然后选择“先进的系统设置”。Click on the "Environment Variables" button. 点击“环境变量”按钮。And scroll down to find "Path" in the variables. 和向下滚动,找到“路径”的变量。Edit the "Path" variable.编辑“路径”的变量。

BE VERY CAREFUL NOT TO DELETE THE PREVIOUS VALUES IN THE PATH VARIABLE!!! 非常小心,不要删除以前的价值在路径变数! ! !ONLY ADD ONTO THEM!!!仅仅添加到他们! ! !

At the end of the path variable, I appended the following (without the double quotes): ";c:\Program Files (x86)\PuTTY\;c:\users\gordon". 在路径变数的尽头,我加以下(没有双引号):“;c:\程序文件(x86)、腻子\;c:\ \戈登用户”。The part after the second semicolon should be where you want to put your batch file. 在第二部分分号应该在你想要把你的批处理文件。The first part is where the putty commands are located (which may vary on different versions of Windows).第一部分是在位于泥的命令(可能有不同的不同版本的窗户)。

Then, I found that I had to reboot my machine in order for Eclipse to know about the new path. 然后,我发现,我不得不重启我的机器为蚀了解新的道路。I speculate that this is because there is a java program running somewhere that picks up the path when it starts, and this is where Eclipse gets the path. 我推测,这是因为有一个java程序运行在某处时拿起路径开始,这就是日食得到的道路。If I'm correct, all that needs to be done is to restart that program. 如果我是正确的,需要做的一切就是重新启动这个程序。Rebooting the machine was easier than tracking down a simpler solution.重新启动机器就容易追踪一个比较简单的解决之道。


Test the Newly Installed Software测试新安装的软件

The equivalent of rsh in this environment is called plink. 相当于rsh叫做plink在这个环境。To see if things work, you need the following:要看事物工作,你需要下列事项:
  • IP address of the other machine. IP地址的其他机器。On a Unix system, you can find this using either "ipconfig" or "ifconfig". 在一个Unix系统,你会发现这个使用的是“ipconfig”或“ifconfig”。In my case, the IP address is 192.168.65.128. 我遇到的情况下,IP地址是192.168.65.128。This is the address of the virtual machine, but this should work even if you are connecting to a real machine.这是虚拟机的地址,但这应该工作即使你连接到一个真实的机器。
  • The user name to login as. 用户名登录。In my case, this is "hadoop-user", which is provided by the virtual machine.对于我来说,这是“hadoop-user”,这是由虚拟机。
  • The password. 密码。In my case, this is "hadoop".对于我来说,这是“hadoop”。
Here is a test command to see if you get to the right machine:这是一个测试命令,看你是否可以向右机:
  • plink -ssh -pw hadoop hadoop-user@192.168.65.128 hostnameplink -ssh -pw hadoop hadoop-user@192.168.65.128主机
If this works by returning the name of the machine you are connecting to, then everything is working correctly. 如果这个作品的名字将你连接到这台机器,这样一切工作,是正确的。In my case, it returns "hadoop-desk".对于我来说,它返回“hadoop-desk”。

Since we are going to be connecting to the hadoop file system, we might as well test that as well. 因为我们要连接到hadoop文件系统,我们不妨测试。I noticed that the expected command:我注意到预期的命令:
  • plink -ssh -pw hadoop hadoop-user@192.168.65.128 hadoop fs -lsplink -ssh -pw hadoop hadoop-user@192.168.65.128 hadoop fs -ls
Does not work. 不工作。This is because the Unix environment is not initializing the environment properly, so it cannot find the command. 这是由于Unix环境下初始化环境不恰当,不能找到命令。On the Yahoo! 在雅虎!virtual machine, the initializations are in the ".虚拟机中,初始化,在”。profile" file. 个人资料”的文件。So, the correct command is:所以,正确的命令是:
  • plink -ssh -pw hadoop hadoop-user@192.168.65.128 source .plink -ssh -pw hadoop hadoop-user@192.168.65.128来源。profile; hadoop fs -ls简介;hadoop fs -ls
Voila! 瞧啊!That magically seems to work, indicating that we can, indeed, connect to another machine and run the hadoop commands.魔法般的似乎工作,表明我们可以,事实上,连接到另一台机器上,运行hadoop命令。


Write the Batch File写这批处理文件

I call the batch file "myhadoop.bat". 我叫该批处理文件”myhadoop.bat”。This file contains the following line:该文件包含以下行:

"c:\Program Files (x86)\PuTTY\plink.exe" -ssh -pw %3 %2@%1 source .profile; hadoop fs %4 %5 %6 %7 %8 %9

This file takes the following arguments in the following order:这个文件需要下列参数在下列顺序:
  • host ip address (or hostname, if it can be resolved)主机的ip地址(或者主机名,如果它是可以解决的,)
  • user name用户名
  • password密码
  • commands to be executed (in arguments %4 though %9)命令执行(在参数% 4虽然% 9)
Yes, the password is in clear text. 是的,密码加密。If this is a problem, learn about PuTTY ssh with security and encryption.如果这是一个问题,了解腻子宋承宪安全而加密。

You can test this batch file in the same way you tested plink.你可以测试这批处理文件用相同的方法plink测试。


Write a Java Class to Run the Batch File写一个Java班跑批处理文件

This is more complicated than it should be for two reasons. 这是更复杂的应该是两个原因。First, the available exec() command does not execute batch files. 首先,可用exec()指令不执行批处理文件。So, you need to use "cmd /q /c myhadoop.所以,你需要使用“cmd / q / c myhadoop。bat" to run it. 蝙蝠”来运行它。This invokes a command interpreter to run the command (the purpose of the "/c" option). 这个调用一个命令解释器运行命令(目的/ c”选项)。It also does not echo the commands being run, courtesy of the "/q" option.它也并不呼应的命令被运行,礼貌的“/ q”选项。

The more painful part is the issue with stdout and stderr. 更痛苦的部分是问题和stderr stdout。Windows blocks a process when either of these buffers are full. 一个窗口过程块这两种缓冲器都满了。What that means is that your code hangs, without explanation, rhyme, or reason. 这是什么意思,你的代码挂件,没有任何解释,押韵,或理由。This problem, as well as others, are explained and solved in this excellent article,这个问题,不如别人好,解释和解决这个优秀的文章,When Runtime.exec() won't当Runtime.exec()不会.

The solution is to create separate threads to read each of the streams. 解决的办法是创建单独的螺纹读每一个小溪。With the example from the article, this isn't so hard. 通过计算实例的文章,这不是如此艰难。It is available in this file:适用于这个文件:HadoopFS.javaHadoopFS.java.

Let me explain a bit how this works. 让我解释一下,有点这是如何操作的。The class HadoopFS has four fields:课堂HadoopFS有四个领域:
  • command is the command that is run.命令就是命令跑。
  • exitvalue is the integer code returned by the running process. exitvalue整数编码是退还给正在运行的进程上。Typically, processes return 0 when they are successful and an error code otherwise.一般来说,当他们返回0过程是成功的,否则一错误代码。
  • stdout is a list of strings containing the standard output.stdout是一个清单,字符串包含标准输出。
  • stderr is a list of strings containing the standard error.stderr是一个清单,字符串包含标准错误句柄。
Constructing an object requires a string. 创建一个对象需要一个字符串。This is the part of the hadoop command that appears after the "fs". 这是hadoop命令的一部分后出现“f”。So, for "hadoop fs -ls", this would be "-ls". 因此,对于“hadoop fs -ls”,这将会是“-ls”。As you can see, this could be easily modified to run any command, either under Windows or on the remote box, but I'm limiting it to Hadoop fs commands.正如你所看到的,这可以很方便地进行修改运行任何命令,要么在Windows下或在遥远的盒子,但是我限制它Hadoop fs的命令。

This file also contains a private class called threadStreamReader. 该文件还包括一个称为threadStreamReader私人课。(Hmmm, I don't think I have the standard java capitalization down, since classes often start with capital letters.) (嗯,我不认为我有标准下,由于爪哇资本化课程往往开始用大写字母。)This is quite similar to the StreamGobbler class in the above mentioned article. 这是很相似的StreamGobbler阶级在上述的文章。The difference is that my class stores the strings in a data structure instead of writing them to the console.不同的是,我的班级字符串存储的数据结构中,而不是写给控制台。


Using the HadoopFS Class利用HadoopFS类

At the beginning of this posting, I said that I wanted to do two things: (1) delete the output files before running the Hadoop job and (2) output the results. 之初,这个贴子,我说我想要做两件事情:(1)前删除输出文件的运行Hadoop工作,(2)输出结果。The full example for the WordCount drive class is in this file:充分借鉴WordCount驾驶课是这个文件:
WordCount.javaWordCount.java.

To delete the output files, I use the following code before the job is run:删除的输出文件,我使用下列代码,然后工作运行:

....HadoopFS hdfs_rmr = new HadoopFS("-rmr "+outputname);
....hdfs_rmr.callCommand();

I've put the name of the output files in the string outputname.我已经把输出文件的名字outputname在字符串中。

To show the results, I use:显示结果,我用:

....HadoopFS hdfs_cat = new HadoopFS("-cat "+outputname+"/*");
....hdfs_cat.callCommand();
....for (String line : hdfs_cat.stdout) {
........System.out.println(line);
....}

This is pretty simple and readable. 这是相当简单易读。More importantly, they seem to work.更重要的是,他们似乎工作。


Conclusion结论


The hadoop framework does not allow us to do some rather simple things. hadoop框架的不允许我们做一些相对简单的事情。There are typically three computing environments when running parallel code -- the development environment, the master environment, and the grid environment. 通常有三个计算环境的程序代码——当平行发展的环境,主人的环境,和网格环境。The master environment controls the grid, but does not provide useful functionality for the development environment. 主人环境控制网格,但不提供有用的功能发展的环境。In particular, the master environment does not give the development environment critical access to the parallel distributed files.特别需要指出的是,不给主人环境发展环境至关重要的访问并行分布式文件。

I want to develop my code strictly in java, so I need more control over the environment. 我想拓展我的代码严格,在爪哇,所以我需要更多的控制对环境的影响。Fortunately, I can extend the environment to support the "hadoop fs" commands in the development environment. 幸运的是,我可以延长的环境支持“hadoop fs的“命令在开发环境。I believe this code could easily be extended for the Unix world (by writing appropriate "cmd" and "myhadoop.我相信这段代码可以很容易被延长了Unix的世界(写适当的“本”和“myhadoop。bat" files). 蝙蝠”文件)。This code would then be run in exactly the same way from the java MapReduce code.这段代码将会运行在相同的方式,从爪哇MapReduce代码。

This mechanism is going to prove much more powerful than merely affecting the aesthetics of the WordCount program. 这一机制是要证明更为强大��不仅仅是影响WordCount美学的程序。In the next post, I will probably explain how to use this method to return arbitrary data structures between MapReduce runs.
 

0

阅读 评论 收藏 禁止转载 喜欢 打印举报/Report
  • 评论加载中,请稍候...
发评论

    发评论

    以上网友发言只代表其个人观点,不代表新浪网的观点或立场。

      

    新浪BLOG意见反馈留言板 电话:4000520066 提示音后按1键(按当地市话标准计费) 欢迎批评指正

    新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 会员注册 | 产品答疑

    新浪公司 版权所有