Test and debug MapReduce

 3r33333. 3r3-31. At Rostelecom, we use Hadoop to store and process data downloaded from multiple sources using java applications. Now we have moved to the new version of hadoop with Kerberos Authentication. When moving, we encountered a number of problems, including the use of the YARN API. Using Hadoop with Kerberos Authentication deserves a separate article, and in this one we’ll talk about debugging Hadoop MapReduce. 3r3306.  3r33333. 3r3306.  3r33333. Test and debug MapReduce 3r3151. 3r3306.  3r33333. When performing tasks in a cluster, launching the debugger is complicated by the fact that we do not know which node will process one or another part of the input data, and we cannot configure our debugger in advance. 3r3306.  3r33333. 3r3306.  3r33333. You can use the time-tested `System.out.println (" message ")` . But how to analyze the output of `System.out.println (" message ")` scattered on these nodes? 3r3306.  3r33333. 3r3306.  3r33333. We can output messages to standard error. Everything written in stdout or stderr,
 3r33333. sent to the appropriate log file, which can be found on the web page of the extended information about the task or in the log files. 3r3306.  3r33333. 3r3306.  3r33333. We can also include debugging tools in the code, update task status messages, and use custom counters to help us understand the scale of the disaster. 3r3306.  3r33333. 3r3306.  3r33333. The Hadoop MapReduce application can be debugged in all three modes in which Hadoop can work:
 3r33333. 3r3306.  3r33333.
  •  3r33333.
  • standalone
     3r33333.  3r33333.
  • pseudo-distributed mode
     3r33333.  3r33333.
  • fully distributed
     3r33333.  3r33333.
3r3306.  3r33333. In more detail we will focus on the first two. 3r3306.  3r33333. 3r3306.  3r33333.

Pseudo-distributed mode 3r32r9595. 3r3306.  3r33333. Pseudo-distributed mode is used to simulate a real cluster. And it can be used for testing in an environment as close to productive as possible. In this mode, all Hadoop daemons will run on the same node! 3r3306.  3r33333. 3r3306.  3r33333. If you have a dev server or another sandbox (for example, a Virtual Machine with a customized development environment, such as Hortonworks Sanbox with HDP), then you can debug a control program using remote debugging tools. 3r3306.  3r33333. 3r3306.  3r33333. To start debugging, you need to set the value of the environment variable: 3r33434. ** [/b] YARN_OPTS ** . Below is an example. For convenience, you can create a startWordCount.sh file and add the necessary parameters to it to start the application. 3r3306.  3r33333. 3r3306.  3r33333.
    #! /bin /bash
3r33333. source /etc/hadoop/conf/yarn-env.sh
export YARN_OPTS = '- agentlib: jdwp = transport = dt_socket, server = y, suspend = y, address = $ 6000 {YARN_OPTS}'
3r33333. yarn jar wordcount-???.jar ru.rtc.example.WordCount /input /output
3r3306.  3r33333. Now, run the script `. /startWordCount.sh` , we will see message
 3r33333. 3r3306.  3r33333.
    Listening for transport dt_socket at address: 6000
3r3306.  3r33333. It remains to configure the IDE for remote debugging (remote debugging). I am using intellij IDEA. Go to the menu Run -> Edit Configurations Add a new configuration 3r-3234. ** [/b] Remote ** 3r3306.  3r33333. 3r3306.  3r33333. 3r3r166. 3r3306.  3r33333. 3r3306.  3r33333. Put a breakpoint in main and run. 3r3306.  3r33333. 3r3306.  3r33333. 3r3125. 3r3306.  3r33333. 3r3306.  3r33333. That's it, now we can debug the program as usual. 3r3306.  3r33333. 3r3306.  3r33333. ATTENTION. You need to make sure that you are working with the latest version of the source code. If not, you may have differences in the lines in which the debugger stops. 3r3306.  3r33333. 3r3306.  3r33333. In earlier versions of Hadoop, a special class was supplied that allowed you to rerun the failed task - isolationRunner. The data that caused the crash was saved to disk at the address specified in the Hadoop environment variable mapred.local.dir. Unfortunately, in the latest versions of Hadoop, this class is no longer available. 3r3306.  3r33333. 3r3306.  3r33333.

Standalone (local launch)

3r3306.  3r33333. Standalone is the standard mode in which Hadoop operates. It is suitable for debugging where HDFS is not used. With this debugging, you can use input and output through the local file system. Standalone mode is usually the fastest Hadoop mode, as it uses the local file system for all input and output data. 3r3306.  3r33333. 3r3306.  3r33333. As mentioned earlier, you can embed debugging tools in your code, for example, counters. Counters are defined by listing (3r3-3150. ** enum ** ) Java. The enumeration name defines the name of the group, and the enumeration fields define the names of the counters. The meter can be useful for assessing the problem, 3r3306.  3r33333. and can be used as an addition to debug output. 3r3306.  3r33333. 3r3306.  3r33333. Ad and use counter:
 3r33333. 3r3306.  3r33333.
    package ru.rt.example; 3r33333. 3r33333. import org.apache.hadoop.io.IntWritable; 3r33333. import org.apache.hadoop.io.LongWritable; 3r33333. import org.apache.hadoop.io.Text; 3r33333. import org.apache.hadoop.mapreduce.Mapper; 3r33333. 3r33333. 3r33333. public class Map extends Mapper    {
private Text word = new Text (); 3r33333. 3r33333. enum word {
TOTAL_WORD_COUNT, 3r3333319.}
3r33333. @Override
public void map (LongWritable key, Text value, Context context) {3r3333319. 3r33333. String[]stringArr = value.toString (). split (" s +"); 3r33333. 3r33333. for (String str: stringArr) {
word.set (str); 3r33333. context.getCounter (Word.TOTAL_WORD_COUNT) .increment (1); 3r33333.}
}
}
}
3r3306.  3r33333. For incrementing the counter, you need to use the method. ** increment (1) ** 3r3306.  3r33333. 3r3306.  3r33333.
  
context.getCounter (Word.TOTAL_WORD_COUNT) .increment (1); 3r33333.
3r3306.  3r33333. After successful completion of the MapReduce task at the end displays the values ​​of the counters. 3r3306.  3r33333. 3r3306.  3r33333.
    Shuffle Errors
BAD_ID = 0
CONNECTION = 0
IO_ERROR = 0
WRONG_LENGTH = 0 3r3333319. WRONG_MAP = 0
WRONG_REDUCE = 0 3r3333319. ru.rt.example.Map $ Word
TOTAL_WORD_COUNT = 655
3r3306.  3r33333. Erroneous data can be output to stderr or to stdout, or write output to hdfs using the class. ** MultipleOutputs ** for further analysis. The obtained data can be transferred to the application in standalone mode or when writing unit tests. 3r3306.  3r33333. 3r3306.  3r33333. Hadoop has the MRUnit library, which is used in conjunction with testing frameworks (for example, JUnit). When writing unit tests, we check that the output function produces the expected result. We use the MapDriver class from the MRUnit package, in whose properties we set the class under test. To do this, use the method `withMapper ()` input values ​​ `withInputValue ()` and the expected result is `withOutput ()` or `withMultiOutput ()` if multiple output is used. 3r3306.  3r33333. 3r3306.  3r33333. Here is our test. 3r3306.  3r33333. 3r3306.  3r33333.
    3r33333. package ru.rt.example; 3r33333. 3r33333. import org.apache.hadoop.io.IntWritable; 3r33333. import org.apache.hadoop.io.LongWritable; 3r33333. import org.apache.hadoop.io.Text; 3r33333. import org.apache.hadoop.mrunit.mapreduce.MapDriver; 3r33333. import org.apache.hadoop.mrunit.types.Pair; 3r33333. import org.junit.Before; 3r33333. import org.junit.Test; 3r33333. 3r33333. import java.io.IOException; 3r33333. 3r33333. public class TestWordCount {
3r33333. private MapDriver

3r3306.  3r33333.
Fully distributed mode (fully distributed mode) 3-333295. 3r3306.  3r33333. As the name suggests, this is a mode in which all the power of Hadoop is used. The launched program MapReduce can run on 1000 servers. It is always difficult to debug MapReduce, since you have Mappers running on different machines with different inputs. 3r3306.  3r33333. 3r3306.  3r33333. 3r3302. Conclusion 3r3303. 3r3306.  3r33333. As it turned out, testing MapReduce is not as simple as it seems at first glance. 3r3306.  3r33333. To save time searching for errors in MapReduce, I used all the listed methods and advise everyone to use them too. This is especially useful in the case of large installations, such as those that work at Rostelecom. 3r33333. 3r33333. 3r33333. 3r33333. 3r33312. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () ();
3r33333. 3r33333. 3r33333. 3r33333. 3r33333. 3r33333.
+ 0 -

Add comment