Assignment 3 Data Joins in Hadoop Solution




This lab is designed to have you do a “data join” Hadoop using a two,

single-file database. You’ll be extending a Hadoop program to use

(possibly) `ChainMappers` and either `DataJoin` or a join-based solution

of your own design (which is what I would recommend).

You also have the option of using the “streaming” interface an python code.

You start by modifying the starter code provided and extending it –

the extension will be to determine, for each patent, the number of

cited patents that originate from the same state. You will use the

description of the `` and `pat63_99.txt` files (see

`` for documentation).

Steps & Directions

Retrieve the data files from Moodle and/or git. You only need the

`` and `pat63_99.txt` files. You should unzip the

data files and place them into your HDFS file system or in a local

directory if you’re using a local development machine.

The `acite75_99.txt` file contains a citation index, of the form




where both CITING and CITED are integers. Each line

indicates that patent number CITING cites patent CITED.

The `pat63_99.txt` file contains the patent number, an (optional)

state in which the patent is filed and the total number of citations


Your job is to augment the data in `pat63_99.txt` to include a column

indicating the number of patents cited that originate *from the same

state*. Obviously, this data can only be calculated for patents that

have originating state information and only for cited patents that

provide that information. You should generate a new file (possibly

multi-part) file that contains the augmented information.

For example,

patent 6009554 (the last patent in pat63_99.txt) cited 9 patents. Those patents were awarded to people in

* NY,

* IL,

* Great Britain (no state),

* NY,

* NY,

* FL,

* NY,

* NY,

* NY.

For the first part, you would produce a new data file that updates the





To be:




The last value `,6` is the number of same-state citations.

To do this, you will first need do a “data join” of the citations and

the patent data – for each cited patent, you’ll need to determine the

state of the cited patent. You can then use that information to

produce the augmented patent information.

It’s useful to produce an intermediate table like



|2134795 |None |5654603 |OH

|2201699 |None |5654603 |OH

|3031593 |None |5654603 |OH

|3093764 |OH |5654603 |OH

|3437858 |OH |5654603 |OH

|3852137 |PA |5654603 |OH

|3904724 |PA |5654603 |OH

This table says that patent `3852137` is from `PA` and `5654603` is from `OH`.

You would construct this for each cited patent. From this, it’s simple to determine

how many patents are self-sited for a given patent data line.

You should construct all of this as a single “Job”, using ChainMapper

to orchestrate the multiple steps. There’s an example

on Stack Overflow at and the Hadoop class files

give descriptions as well and I’ve given you template code to start with.

You can also construct this using Python and I highly recommend doing that first.

Input data files

Originally, the data files were part of the template git repo but that

causes problems with Github classroom. You should be able to download

or access the datafiles using the link

— this will show them to you in Google Drive.

What to hand in

You should deveop your Python and/or Java solution in the individual subdirectories.

In each case, you should include a `` file that describes the steps of your solution.