CSC108H Assignment 2, Fall 2011
Image Search on the Web

Due: Thursday, 3 November 2012, 10:00 pm sharp

This assignment should be completed individually.

The purpose of this assignment is to give you practise using loops, strings, and lists, and designing helper functions. You will complete a program that does image search on the web --- a (very!) scaled down version of what tools like Google Images do.

1. Introduction

Web search

Most of us can't imagine getting through a day without searching the web. Have you thought about how a search engine like Google knows what web pages are relevant to your search term? It regularly "crawls" the web, following links from page to page, creating "indices" that store relationships between web pages and words on those pages. Then, when you search for "libya news", for instance, the search engine looks that term up in its indices and retrieves the list of associated web pages. There is a vast amount of interesting computer science that goes with all of this, including how to store and structure the indices so that searching them will be fast (even though the indices are enormous) and how to order the results so that the most relevant ones are presented first.

So far, we've been talking about searching through text. How about image search? HTML (Hypertext Markup Language) is the language used to tell your web browser things like which pieces of the text are items in a numbered list, what webpage a link should take you to, and so on. HTML also includes some notation for putting an image on a web page. Authors also have the option of including "alt" text (short for "alternative text") that describes the image. This text has many uses. For example, those who have images turned off in their browser can be shown the text instead, and visually impaired users can use software that reads the text to them. Your program is going to use the alt text to help users find images about desired topics.

A little bit about HTML

If you'd like to learn about HTML there are many online resources (such as this one), but the only HTML you need for this assignment is the tiny bit described here.

HTML is a "markup" language; we annotate (or mark up) the text to be displayed with additional information that tells a browser how it should be displayed. For instance, here is a small piece of HTML:

<ul>
   <li>butterscotch sauce</li>
   <li>whipped cream</li>
   <li>chopped pecans</li>
</ul>
    
and here is how it is displayed in your browser: The formatting of the HTML itself doesn't matter, since white space is ignored, but we use line breaks and tabs to make it easier for humans to read. What does matter are the "tags", which are the bits enclosed in angle brackets. The beginning and ending of an unordered list is marked with <ul> and </ul>. Similarly, <li> and </li> mark the beginning and ending of an item within a list. Some tags need to include additional information. For instance, to define a link such as this one to the 108 course website, you use the tag <a> (for "anchor") as follows:
    <a href="http://www.cdf.utoronto.ca/~csc108h/fall/">108 course website</a>
    
The text of the link ("108 course website") goes between the tags, but we also need to provide a URL. The URL is placed inside the opening tag, using href="http://www.cdf.utoronto.ca/~csc108h/fall/", or whatever URL the link should connect to. href is called an "attribute".

In summary, most tags have this general structure:

        <tag-name attribute-1="value-1" attribute-2="value-2" ... attribute-n="value-n">text</tag-name>
    
A tag can have no attributes at all, and like Python, the quotes around an attribute value can be single (') or double (") quotes. However, each pair of quotes must match. (You can't start an attribute with a single quote and end it with a double quote.) If a tag has multiple attributes, they can occur in any order, and there can be extra whitespace between the pieces of an img tag. For instance, this is valid:
        <student     college   ="St. Mike's" cgpa = '4.0'    credits="13.5"  >text</student>
    

Not the best 108 instructor For this assignment, the only HTML tag that matters is the img tag, which is used for including images in a web page. Here's what the HTML looks like for including a gratuitous image of one of the 108 instructors:

    <img src="dan_small.jpg" alt="Not the best 108 instructor" align='left'>
    
The img tag has several attributes, some of them optional. The only ones that matter for this assignment are src, a mandatory attribute which says where to find the image file, and alt, an attribute used to provide alternative text for the image. In the latest HTML specification, alt is mandatory, except in some very specific cases. However, there are many webpages containing img tags that don't have an alt attribute. Your program will ignore those img tags.

2. The Starter Code

Your image search program will begin at a given URL, "crawl" the web to find images, and build an index structure that records information images and their alt text. Then, it will ask the user to provide a filter describing the images that they want to know about and will provide a listing of the images that satisfy the filter.

We are providing crawl.py, which handles the interaction with the user and the crawling itself. The code in crawl.py is complete; you don't need to modify anything in it, and you won't be handing it in. We are also providing a2.py, which contains functions that crawl.py uses to build the index and do the filtering. The code in a2.py currently includes only definitions and docstrings for the functions. Your job is to replace each pass statement with code that implements that function.

Overview of the Starter Code

The __main__ section in crawl.py asks the user for a URL and calls function crawl to crawl the web starting at that location. If it were to continue following links with no limits, it would crawl the entire web, which would take ... a while. To avoid this, it asks for the maximum "depth" to crawl to, and instructs the crawl function to follow sequences of links that are at most that long. crawl returns a structure that records information about alt text and the images associated with that text. We are calling it an "image association list", and its structure is defined below. Some words, such as "photo" and "the" come up so frequently in alt tags that they don't provide any useful information. So the main block asks the user to choose a limit on the length of the image association lists to be built, and passes this limit on to function crawl.

After crawling, the program reports on the images found and the associations stored in the IAL. It then asks the user for a filter (see below for details) and reports the list of all images that satisfy the filter.

The main block does all the necessary input and output and uses three helper functions to handle most of the work:

Crawling the web: Just FYI

You don't need to understand our crawl function in order to do this assignment, but you may be curious. It uses a couple of clever techniques. The first is to keep a "worklist" -- a list of webpages that it needs to visit. Of course, at the beginning there is just one: the page where the user has indicated the crawling should begin. Whenever the crawl function reads a page, it finds all the links on it. To find those links, the function uses a module called re ("regular expression") to find the href tags. (Note that you are not allowed to use the re module in the code you write.) It adds each of these links to the list of webpages that it needs to visit. Each time through the main while loop, the function visits another webpage on the worklist and processes it as we just described.

In the list of pages to crawl, the code remembers not only the URLs, but also their depth -- how many links away they are from the webpage we started at. We need to remember the depth to cut off any branches that get too long. (Remember: we don't want to crawl the entire web.) Another problem the code avoids is looping branches: it's easy to find a sequence of links from webpage to webpage that takes you back to where you started. If the program did this, it could loop forever (except for the depth limit it imposes). To avoid wasting time on loops, the code keeps a list of webpages it has visited, and makes sure never to visit them again.

Programs like ours are a kind of "bot" or "web robot". Because they can create a heavy load on the web servers of popular websites, a protocol has evolved to give administrators a way to control this access: They can create files called robots.txt that say whether they welcome requests from bots and, if so, which pages may be crawled. Our function crawl respects this protocol. It uses a module called rp to look for a robots.txt file, and only reads a webpage if it is welcome to do so.

3. Your Job

Your job is to complete the functions in a2.py. These include functions process_page, cleanup, all_images and process_filter_description mentioned above, as well as a handful of others. Even if you design your code so one of the functions isn't actually called, you still must complete all the functions in the a2.py file. We will test each function independently.

You may not add any code to a2.py that is not a function. (In particular, you must not define any variables that are outside the functions you are writing, and you must not import additional modules in a2.py.)

Make sure you understand the docstrings of all of the function in a2.py before you begin, because some of the functions may prove to be great helpers to others. In addition, look for opportunities to design your own helper functions. Using helper functions is essential in order to avoid repetitive code and to make sure that each function is a small, cohesive piece of code. Part of your mark will be for good design and use of helpers.

Note that our function crawl is longer than most of the functions you should write. Generally, we like to see functions that are no more than a page long when printed, including comments. If you use helper functions well, this will happen naturally. We pulled out two parts of what crawl does by using helper functions. Although the code that is left is a little long, it is difficult to break down further without actually interfering with the reader's ability to see the algorithm. You should expect most of your functions to be quite a bit shorter than this, and you should give equally careful consideration to the use of helper functions.

Throughout the program, you may assume that all input is valid. For example, later in this handout we say that you may assume certain things will never occur on the web pages your code has to process. If those things did occur, that input would be invalid. We will not test your program on invalid input.

Below are some further requirements, as well as tips that will help you with your code.

The image association list

We define an association list to be a list of lists, such as

        [ [3, ["hello", 27.4, True]], 
          [7.76, ["wonder"]], 
          ["drama", [13, "comedy", "goodbye", 1]] ]
    
Each sublist has two elements: An image association list is an association list in which each key is a string and its associated list contains the names of image files that have that string in their img tag.
        [ ["madonna", ["img3541.jpg", "img1234.jpg"]], 
          ["mtv", ["img2999.jpg", "img1234.jpg", "gaga.JPG", "gaga22.JPG"] ]
    
Your code will build an image association list as the program crawls the web.

What are the keys in the image association list?

The alt text for an image is often a phrase or a whole sentence. Ideally, an image search engine would identify the words in the alt text, and associate each one with the image. But correctly finding the words in a string is actually somewhat difficult. To do it properly, you would have to deal with complexities involving punctuation and other things. In order to spare you that, we want you to break the alt text into pieces simply by calling split (with no arguments) on the full string. We also want you to convert all the keys to lowercase, to make it easier to do matching. For example, in this img tag

<img align=top src="photos/horton.JPG" alt="Image of instructor (Diane Horton)">
    
the pieces of the alt attribute would be what you get from calling split (with no arguments) on "Image of instructor (Diane Horton)" and then converting each result to lowercase: "image", "of", "instructor", "(diane", and "horton)". Notice that "(diane" and "horton)" are not proper words. Don't try to improve on that; just take what split gives you, or your code will be considered wrong by our autotester.

Rather than calling these pieces "words" (which they aren't necessarily), computer scientists use the term "tokens". Your code must make sure that when the program finds an img tag with alt text, each of the tokens in the alt text becomes a key in the image association list, and that the name of the image file is included in the list of image files associated with that key.

What are the image file names in the image association list?

For the name of the image file, just use the exact value of the src attribute in its img tag. Sometimes this will be a complete URL such as the one in this example:

    <img alt="ambassadors" src='http://web.cs.toronto.edu/Assets/DCS+Digital+Assets/ambassadors2011.jpg'>
    
The string 'http://web.cs.toronto.edu/Assets/DCS+Digital+Assets/ambassadors2011.jpg' can be cut and pasted unaltered into a web browser to look up the image. Other times, an img tag uses a relative address, such as in this example:
    <img align=top src="photos/horton.JPG" alt="Image of instructor (Diane Horton)">
    
The string 'photos/horton.JPG' is not a complete URL. One could convert relative addresses such as this into complete URLs, but don't try to do that: there are too many wrinkles that make it a bit difficult. More importantly, our autotesting code is expecting the unaltered src strings, and your code will be marked as incorrect if it does anything different.

Image association list example

If your program encounter these img tags

    <img align=top src="photos/horton.JPG" alt="Image of instructor (Diane Horton)">
    <img src="MissKate.JPG" alt="ballet instructor">
    <img alt="ambassadors" src='http://web.cs.toronto.edu/Assets/DCS+Digital+Assets/ambassadors2011.jpg'>
    
the image association list should be this list:
    [ ["image", ["photos/horton.JPG"]], 
      ["of", ["photos/horton.JPG"]],
      ["instructor", ["photos/horton.JPG", "MissKate.JPG"]],
      ["(diane", ["photos/horton.JPG"]],
      ["horton)", ["photos/horton.JPG"]],
      ["ballet", ["MissKate.JPG"]], 
      ["ambassadors", ['http://web.cs.toronto.edu/Assets/DCS+Digital+Assets/ambassadors2011.jpg'] ]
    
Work through this small example by hand and make sure that you generate the same image association list. (It doesn't matter whether your write a string with single or double quotes -- it is still the same string. Note that the Wing shell always shows strings with single quotes.)

Although you are likely to generate your image association list in the exact same order as above, we will accept two variations:

Hints for finding and pulling apart the img tags

Some of your code has the task of going through the text of a web page to find all img tags. If an img tag has an alt attribute, it must find the value of the alt attribute as well as the value of the src attribute so that it can update the image association list. These hints will help you:

The filtering language

We invented a language for filtering all the pictures your crawler finds. Here is a simple example:

    :MTV: and not :Madonna:
    
This "filter description" says the user wants every image that is associated with the string "MTV" and is not associated with "Madonna". (IMPORTANT: This is not the same as every image that is associated with the string "MTV" plus every image that is not associated with "Madonna"!) More generally, we define a filter description to be a string made up of one or more terms separated by "and". Each term is a sequence of characters surrounded by colons, with optionally the word "not" before it. We will not test your program on invalid filters.

Notice that this filter description contains some uppercase letters. But our image association lists only store keys that have been converted to lowercase. When you apply a filter description, your code must convert each term to lowercase so that it can do a better job of matching.

Hints for pulling apart and processing the filter description

The split method will be very helpful for pulling apart the pieces of a filter description. Remember that you can split based on a single character or a longer string.

As you identify each piece of the filter, you need to process it. Here's how. Start by making a list of every image in your entire image association list. You're going to pare it down to exactly the images the user wants: For each term in the filter description, take the current list of images and reduce it to include only those that satisfy that term -- either by being associated with the string in it, or not associated with the string in it (if the term has a "not" on the front).

You may assume that neither the string and nor the string not will occur inside the paired colons of a filter description. For example, you will never have to process this filter description

        :andrew: and not :utm: and :notification:
    
It's an unrealistic assumption, but it makes the function easier to write.

Summary of how to deal with uppercase and lowercase characters

As we've said, tag names and attribute names can contain uppercase and lowercase characters. This means that you need to recognize the tag names "img", "IMG", "iMG" and every other combination, and you need to recognize attribute names "src", "SRC", "sRc" etc., and "alt", "ALT", "alT" etc.

Alt text can also contain uppercase and lowercase characters. In order to make matching with filter terms easier, the function record_associations converts each token to lowercase before storing it in the image association list, and the function process_filter_description converts each term to lowercase before using it to filter the list of images.

URLs are case sensitive. For example, this is the URL of the course discussion board:
https://csc.cdf.toronto.edu/bb/YaBB.pl?board=CSC108H1F
but this version with different capitalization fails:
https://csc.cdf.toronto.edu/bb/Yabb.pl?board=CSC108H1F
This means that when you find an img tag and insert the value of the src attribute into the image association list, you must preserve the capitalization.

Testing your code

As before, we are providing a self-test module that determines whether your code works correctly in some basic situations. We will do a much more thorough job of testing your code once you hand it in, so be sure that you have thoroughly tested it yourself.

Since all of your code is bundled into functions and we wrote the main block, the self-test module will only test your functions individually, not the whole module. (Our final autotesting will do the same.) The easiest way for you to test your code is also to test your functions individually. We suggest that you make a separate test module for each function or perhaps group of related functions. The first thing your test module(s) will do is import your module a2. You'll need to construct values to pass in to your functions, such as valid img tags and valid filter descriptions. For test cases that involve a string with quotes inside it, you'll find it easiest to surround the whole string in triple quotes. In some cases you can create values for testing one function by calling one of your other functions, once that other function is working correctly.

When you test function process_page, you must pass in the text of a web page. You should definitely not make this up by hand! You can use module urllib to grab actual web pages from the web, or you can save webpages to your computer and read the contents of the file into a string. Here is an example of how you can read a webpage into a string.

4. Advice on How to Tackle this Assignment

You don't have to use or understand the code in crawl.py in order to do this assignment. We've given it to you just so that the code you will write can be plugged into it to do something really cool. All you need to focus on is section 3, "Your Job". Understanding that should be your first task. It's very detailed material, so you can't just read it once through. You'll need to read it very carefully and think through examples. You'll probably find some things you aren't sure about, which will lead to re-reading, more examples, and perhaps asking some questions. Do this before you jump into the code.

Your next task is to read the starter code and understand what the functions must do. Many of the docstrings will be a challenge to understand. As you read a docstring, you should immediately write down a specific example of a call to the function and figure out (1) if it returns a value, what value it should return, and (2) if it changes any arguments, what it should change them to. You can also examine the self-test module to see an example of a call to each function and its expected results. (The code looks complicated, but most of it is just messy strings that are used to give you good feedback on any errors. For each test case, focus on the assignment statements that set things up and the boolean expression right after the keyword assert that checks the correctness of what your function did.) Having a concrete example for each function will help ensure that you understand the docstring and are well prepared before you try to write the function body. If you take it a step further and write that example as code that will test your function, you will be in great shape when it's time to test your code!

Once you've done all this, you are ready to write code. Think about which of the functions might be useful helpers to other functions. Write these functions first.

5. Marking

These are the aspects of your work that we will focus on in the marking:

6. What to Hand In

The very last thing you do before submitting should be to run the self-test module one last time. Otherwise, you could make a small error in your final changes before submitting that causes your code to receive zero for correctness.

You must hand in your work electronically, using the MarkUs online system. Instructions for doing so are posted on the Assignments page of the course website.

For this assignment, hand in a single module called

This should contain our starter code plus all of your own code, including all helper functions that you have written. Make sure that you have not altered our code in any way except to add missing function bodies and additional helper functions.

Once you have submitted, be sure to check that you have submitted the correct version; new or missing files will not be accepted after the due date. Remember that spelling of filenames, including case, counts. If your file is not named exactly as above, your code will receive zero for correctness.

7. Some things you might enjoy thinking about

There are many connections between the concepts in this program and the discipline of computer science. You might enjoy to read more and think ahead to what you might learn in further courses.