Your cart is currently empty!
Objective In this assignment, you will work with a simple web crawler to measure and study the characteristics of selected USC school websites. Preliminaries To begin we will make use of an existing open source Java web crawler called crawler4j. This crawler is built upon the open source crawler4j library which…
In this assignment, you will work with a simple web crawler to measure and study the characteristics of selected USC school websites.
To begin we will make use of an existing open source Java web crawler called crawler4j. This crawler is
built upon the open source crawler4j library which is located on github. For complete details on
downloading and compiling see
https://github.com/yasserg/crawler4j
Your task is to configure and compile the crawler and then have it crawl one of the USC schools’ websites. In the interest of distributing the load evenly and not overloading the school servers, we have pre-assigned the school to be crawled according to your USC ID number, given in the table below.
The maximum pages to fetch can be set in crawler4j and it should be set to 5000 to ensure a reasonable execution time for this exercise. Also, maximum depth should be set to 5 to ensure that we limit the crawling.
You should crawl only the school website assigned to you, and your crawler should be configured so that it does not visit pages outside of the given school website!
USC ID ends with | School to crawl | Root URL | ||||||||||
01~14 | Architecture | http://arch.usc.edu/ | ||||||||||
15~26 | Cinematic Arts | http://cinema.usc.edu/ | ||||||||||
27~38 | Dornsife (College) | http://dornsife.usc.edu/ | ||||||||||
39~49 | Gould (Law) | http://gould.usc.edu/ | ||||||||||
50~60 | Keck (Medicine) | http://keck.usc.edu/ | ||||||||||
61~71 | Marshall (Business) | http://www.marshall.usc.edu/ | ||||||||||
72~78 | Viterbi (Engineering) | http://viterbi.usc.edu/ | ||||||||||
79~88 | Price (Public Policy) | http://priceschool.usc.edu/ | ||||||||||
89~00 | Social Work | http://sowkweb.usc.edu/ | ||||||||||
Limit your crawler so it only downloads HTML, doc and pdf files. These files should be retained for use in exercises 3 and 4.
1
Your first task is to enhance the crawler so it collects information about:
Note1: you should modify the crawler so it outputs the above data into three separate csv files; we may use them for processing later;
Note2: you should save all of the downloaded web pages, etc. for processing in the next exercise.
Based on the information recorded by the crawler in its output files, you are to collate the following statistics for a crawl of your designated school website:
The total number of URLs that the crawler attempted to fetch. This is usually equal to the MAXPAGES setting if the crawler reached that limit; less if the website is smaller than that.
The number of URLs that were successfully downloaded in their entirety, i.e. returning a HTTP status code of 2XX.
The number of fetches that failed for whatever reason, including, but not limited to: HTTP redirections (3XX), client errors (4XX), server errors (5XX) and other network-related errors.1
o Total URLs extracted:
The grand total number of URLs extracted from all visited pages
o # unique URLs extracted:
The number of unique URLs encountered by the crawler
o # unique URLs within School:
The number of unique URLs encountered that are associated with the school website, i.e. the URL begins with the given root URL of the school.
2
The number of unique usc.edu URLs encountered that were not from the school website.
The number of all other unique URLs encountered
including (but not limited to): 200, 301, 401, 402, 404, etc.
These statistics should be collated and submitted as a plain text file whose name is CrawlReport.txt, following the format given in Appendix A at the end of this document.
Computing Page Rank
For Assignment #3 you will need to compute the Page Rank of each page you download in this assignment. To do that you will need to keep a record of all URLs contained in a given page. Therefore you should also generate a csv file that includes every successfully downloaded html file, in column 1, and the outgoing URLs that were contained in the page, in subsequent columns 2, 3, 4, etc. Name the file pagerankdata.csv. You need not submit this file with your assignment #2, but you will need it when you get to assignment #3.
Make sure you understand the crawler code and outputs before you commence collating these statistics.
All the information that you are required to collect can be derived by processing the crawler output.
http://usc.edu/foo and http://usc.edu/FOO
The page served may be the same because:
This is one of the reasons why deduplication is necessary in practice.
Also check your Java version; the code includes more recent Java constructs such as the typed collection List<String> which requires at least Java 1.5.0.
log4j:WARN No appenders could be found for logger log4j:WARN Please initialize the log4j system properly.
INFO [Crawler 1] I/O exception (org.apache.http.NoHttpResponseException) caught when processing request: The target server failed to respond INFO [Crawler 1] Retrying request
As indicated by the info message, the crawler will retry the fetch, so a few isolated occurrences of this message are not an issue. However, if the problem repeats persistently, the situation is not likely to improve if you continue hammering the server at the same frequency. Try giving the server more room to breathe:
/*
*/
config.setPolitenessDelay(2500); // CHANGE THIS TO 2500 OR HIGHER.
java.lang.StringIndexOutOfBoundsException: String index out of range: -2
java.lang.NullPointerException: charsetName
4
problems are few in number (compared to the entire crawl size), and for this exercise we’re okay with it as long as it skips the few problem cases and keeps crawling everything else, and terminates properly – as opposed to exiting with fatal errors.
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
5
CrawlReport.txt
o visit.csv o urls.csv
crawl.zip
Use only standard zip format. Do NOT use other formats such as zipx, rar, ace, etc. The zip file will contain:
CrawlReport.txt, (the statistics file)
fetch.csv
visit.csv
urls.csv
$ submit -user csci572 -tag hw2 crawl.zip
Appendix A
Use the following format to tabulate the statistics that you collated based on the crawler outputs.
Note: The status codes and content types shown are only a sample. The status codes and content types that you encounter may vary, and should all be listed and reflected in your report. Do NOT lump everything else that is not in this sample under an “Other” heading. You may, however, exclude status codes and types for which you have a count of zero.
Warning: The files you submit will be used for automated grading. Failure to follow the format strictly may result in inability to grade your submission. Use plain text files only (*.txt). Do NOT use rich text format (RTF).
CrawlReport.txt
Name: Tommy Trojan
USC ID: 1234567890
School crawled: Architecture
Fetch Statistics
================
6
Outgoing URLs:
==============
Total URLs extracted:
Status Codes:
=============
200 OK:
301 Moved Permanently:
File Sizes:
===========
< 1KB:
1KB ~ <10KB:
10KB ~ <100KB:
100KB ~ <1MB:
>= 1MB:
Content Types:
==============
text/html:
image/gif:
image/jpeg:
image/png:
application/pdf:
7