Description of Face Detection Package for Tasks 1 and 2.
The directory face_detection contains images and code that you will use in Task 1 and Task 2 for face detection evaluation. The entire zipped directory can be downloaded as ZIP file face_detection.zip.
The face_detection/images subdirectory contains 39 images, that were selected from the public WIDERFace dataset. File face_detection/ground_truth.txt provides the ground truth on face locations for those images. For each image, the ground truth file contains the following information:
- A line specifying the directory and filename where that image is stored. For this assignment, you should ignore the directory, since all 39 images are stored in the same directory. For example, consider the first line in the file:
From this line, we ignore the “0–Parade/” part, which specifies a directory. We keep the “0_Parade_marchingband_1_641.jpg” part, which is the image filename.
- A line specifying the number of faces in that image. For example, for the “0_Parade_marchingband_1_641.jpg”, there is one face.
- For each face in the image, a line specifying the bounding box for that face. For example, for the “0_Parade_marchingband_1_641.jpg” image, the line
349 179 171 221 0 0 0 0 0 0
specifies a bounding box whose top-left corner is at column 349 and row 179, the width is 171 pixels, and the height is 221 pixels. We only care about those four numbers. The line has six additional numbers, that we don’t care about.
You are provided with a Matlab function read_ground_truth.m which reads this ground truth file. Notice that for bounding boxes, this function converts each bounding box to the format [top bottom left right], which is different than the format used in the text file.
You are also provided with a Matlab function template_detector.m, which implements a multiscale template detector based on normalized cross-correlation scores. The fourth argument to that function is the detection threshold. Any normalized cross-correlation scores below the threshold will NOT be included in the results. Note that scores above the threshold may still not be included in the results, if they are suppressed by nearby scores that are even higher.
Task 1 (20 points)
The code in the face_detection directory provides a Matlab function detection_accuracy.m, which you will need to complete. This function takes the following arguments:
- ground_truth_file. Here we will just use the face_detection/ground_truth.txt file that we have already described.
- template. This is an image (not a filename) of a face template. Examples of face templates that you can use are available in files average_face.png and average_face_cropped.png
- scales: an array specifying the scales at which we will do template search. You can create such an array using the make_scales_array.m function.
- detection_thr. This is the detection threshold that gets passed to the template_detector function.
- iou_thr. This is the intersection-over-union (IoU) threshold that you will use to decide if a detection result is correct or not, and whether a face was successfully detected or not.
Function detection_accuracy has the following return values:
- tp: This is the total number of true positives in the dataset. A detection result on an image is counted as a true positive if it has an IoU score ≥ iou_thr with at least one real face location in that image (where real face locations are obtained from the ground truth).
- fp: This is the total number of false positives in the dataset. A detection result on an image is counted as a false positive if it not a true positive.
- fn: A real face location is counted as a false negative if it has an IoU score < iou_thr with all detection results for that image. In other words, if the ground truth specifies that there is a face at a specific location, and none of the detection results has a sufficiently high IoU score with that location, then that location is counted as a false negative. Intuitively, you can thin that any face location that is specified in the ground truth will produce either a true positive result or a false negative result.
To complete the detection_accuracy function and make it work, you need to implement a check_boxes function. This check_boxes function gets called at line 22 of the detection_accuracy function, and is used to determine the number of true positives, false positives, and false negatives corresponding to the detection results that we got for some image. The check_boxes function has the following specs:
- Input arguments:
- boxes: This is a matrix of N rows and 6 columns, where each row specifies the bounding box of a detection result. Each row has the format [top, bottom, left, right, score, scale]. In this function, the score and scale do not matter.
- ground_truth: This is a matrix of M rows and 4 columns, where each row specifies the bounding box of a real face location that is specified by the ground truth. Each row has the format [top, bottom, left, right].
- iou_thr: The IoU threshold that should be used to determine if a detection result matches a ground truth location or not.
- Return values:
- tp: The number of true positives in the detection results.
- fp: The number of false positives in the detection results.
- fn: The number of false negatives in the ground truth locations.
As a reminder, examples of face templates that you can use are available in files average_face.png and average_face_cropped.png.
On my desktop, it takes about 60-70 seconds to run the detection_accuracy function on the whole 39 images, when I set the argument scales equal to make_scales_array(1, 5, 1.1). For quicker testing, you can edit line 11 in the detection_accuracy function, so that the for loop does not go through the entire dataset.
Task 2 (10 points)
Suppose that you are going to use our template detector for a face detection software library that you will sell to customers. Your customers will be programmers, so they will be able to call the template detector with different detection thresholds. You need to decide if your face detection should use the template in file average_face.png or the template in file average_face_cropped.png. Describe in text what choice you will make, and provide the justification for it. Your choice needs to be based on results that you get on the dataset we are using for Task 1. You do not have the option to include both templates, you need to choose one of them.
IMPORTANT: in order to get some more uniformity over possible answers, your answer should be based only on detection results using scales produced by make_scales_array(1, 5, 1.1).
Put your answer in a text or PDF file called task2.txt or task2.pdf, and include that file in your assignment5.zip submission.
Task 3 (25 points)
Write a function camera_matrix = perspective_calibration(correspondences), that returns the 3×4 camera matrix that maps 3D world coordinates to 2D pixel coordinates. The input argument correspondences is an Nx5 matrix of double values. The n-th row of this matrix is of the format [x, y, z, u, v], where:
- (x, y, z) are the 3D co-ordinates of some point.
- (u, v) are the pixel coordinates where that point projects to in the image. Note that u is the horizontal coordinate (column number) and v is the vertical coordinate (row number).
Examples of correspondences that you can use are stored in files img1_wc1.txt (corresponding to image c1_img1.png shown above) and img1_wc2.txt (corresponding to image c2_img1.png shown above). Each of these two files shows the 3D world coordinates and 2D pixel locations of the 28 visible cube vertices visible on the corresponding image that is shown above.
IMPORTANT: the function should take as input a matrix (a 2D array), and NOT a filename. If you have file img1_wc1.txt on your current directory, and you want to load the data from that file to a matrix called correspondences, just type:
correspondences = load('img1_wc1.txt')
Task 4 (30 points)
Write a function result = estimate_3D_point(c1, c2, u1, v1, u2, v2) that estimates the 3D world coordinates of a point, given two camera matrices and given the locations where that point projects on those two cameras.
Argument c1 is the 3×4 camera matrix of the first camera, and argument c2 is 3×4 camera matrix of the second camera.
Argument u1 is the horizontal coordinate (column number) of the pixel where the 3D point projects in the first camera. Argument v1 is the vertical coordinate (row number) of the pixel where the 3D point projects in the first camera. Similarly, arguments u2 and v2 are the horizontal and vertical coordinates of the pixel where the 3D point projects in the second camera.
The result is a 3×1 column vector, whose format is [x, y, z]’, where (x, y, z) are the 3D coordinates we are looking for.
Some example data to experiment with are stored in file img2_c1c2.txt. Each row on that file is of the form [u1, v1, u2, v2], where (u1, v1) is a pixel location on image c1_img2.png shown above, and (u2, v2) is the corresponding pixel location on image c1_img2.png shown above. The 21 lines of that file contain the correspondences for each of the 21 visible cube vertices on those two images. To test your code, you can do these steps:
- Use your solution to Task 3 and the correspondences in file img1_wc1.txt to compute the camera matrix for the first camera.
- Use your solution to Task 3 and the correspondences in file img1_wc2.txt to compute the camera matrix for the second camera.
- Call your function using the two calibration matrices you just computed, and using one line from file img2_c1c2.txt to get your u1, v1, u2, v2 values.
For the 21 points described in img2_c1c2.txt, the correct 3D locations can be found in files img2_wc1.txt and img2_wc2.txt. Your result will not be identical to those locations, but it should be reasonably close if your solution is correct.
Task 5 (15 points)
Write a function result = pinhole_location(correspondences), that estimates the pinhole location of a camera.
The input argument correspondences is of exactly the same format as in Task 3. As in Task 3, it should be an Nx5 matrix, NOT a filename. Each line of this argument (as in Task 3) is of the format [x, y, z, u, v], and describes the correspondence between a 3D world location and a 2D pixel.
The output should be a 3×1 column vector of the format [x, y, z]’, showing the 3D location of the pinhole of the camera.
How to submit
Submissions are only accepted via Blackboard. Submit a file called assignment5.zip, containing the following files:
- The Matlab source files implementing your solutions to the programming tasks.
- Any additional Matlab source files that are needed to run your code. If your code needs any code files available on the course website, please include those files with your submission.
- A README.txt file containing the name and UTA ID number of the student. No other information is needed for README.txt.
We try to automate the grading process as much as possible. Not complying precisely with the above instructions and naming conventions causes a significant waste of time during grading, and thus points will be taken off for failure to comply, and/or you may receive a request to resubmit.
Please only include source code in your submissions. Do not include data files.
Code must run in Matlab version 2018b.
The submission should be a ZIP file. Any other file format will not be accepted.