Home

Cookbook: MapReduce

Note: See the corresponding lecture notes about MapReduce. This page has cookbook recipes.

Kill a job

On delenn, first list the active jobs:

mapred job -list

Find yours, then kill it:

mapred job -kill <job-id>

Find out which file is being processed by map

From StackOverflow.

public void map(Object key, Text value, Context context)
    throws IOException, InterruptedException
{
    // full path:
    String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();

    // just file name:
    String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

    // ...
}

Map over files recursively

Want to process all files in a directory and subdirectories? Use this technique:

// in main()

FileInputFormat.setInputDirRecursive(job, true);

Only map over files that match a certain regex

Create a class that implements PathFilter. The class below can be configured to use any regular expression:

// from: https://hadoopi.wordpress.com/2013/07/29/hadoop-filter-input-files-used-for-mapreduce/
public static class RegexPathFilter extends Configured implements PathFilter {

    Pattern pattern;
    Configuration conf;
    FileSystem fs;

    @Override
    public boolean accept(Path path) {
        try {
            if (fs.isDirectory(path)) {
                return true;
            } else {
                Matcher m = pattern.matcher(path.toString());
                System.out.println("Is path: " + path.toString() + " matches "
                        + conf.get("file.pattern") + " ? , " + m.matches());
                return m.matches();
            }
        } catch (IOException e) {
            e.printStackTrace();
            return false;
        }
    }

    @Override
    public void setConf(Configuration conf) {
        this.conf = conf;
        if (conf != null) {
            try {
                fs = FileSystem.get(conf);
                if(conf.get("file.pattern") == null) {
                    conf.set("file.pattern", ".*");
                }
                pattern = Pattern.compile(conf.get("file.pattern"));
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

Use it like so:

// in main()
Configuration conf = new Configuration();
conf.set("file.pattern", ".*(Users\\.xml|postsanswers\\.txt)");
Job job = Job.getInstance(conf, "users reputation");

// ...
FileInputFormat.setInputPathFilter(job, RegexPathFilter.class);

It might be good to combine this technique with the ‘recursive’ technique above.

CINF 401 material by Joshua Eckroth is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Source code for this website available at GitHub.