Submission format: You should electronically submit (via Canvas) one file that should either be a Jupyter notebook or a zip file containing a Jupyter notebook and any other files that you want to include. The file name should contain your candidate number.

Marking and feedback: You will be told your mark and receive feedback via Canvas.

Weighting: This assessment is worth 50% of your overall mark for this module.

Overview

For this assessment you are asked to submit a Jupyter notebook that reports on what you have learnt about the complexity of different algorithms for finding the similarities of documents using a bag-of-words representation.

Scenario

Imagine, you have a collection of d (d > 10) documents each containing at least w (w > 50) English words. For testing purposes, you may wish to construct or collect such a collection. A document can be represented as a bag-of-words, i.e., as the set of the words it contains together with associated frequencies. This is most naturally stored using a Python dictionary – and this is called a sparse representation of the data. Alternatively, the dictionary can be converted to a vector (e.g., a numpy array) where there is a dimension for each of the V words in the English vocabulary. This is called a dense representation.

During lab sessions you will have been expected to have implemented various algorithms that will enable you to address each of the following five questions:

1. Present an analysis of the theoretical running time of Jaccard’s similarity measure applied to large documents represented as bags of words (in a Python dictionary). Test your analysis empirically by timing and plotting various calculations of Jaccard similarity on your computer. Estimate relevant constants for your implementation and computer.

2. What is the theoretical worst case running time of the cosine similarity measure applied to documents represented as (dense representation) vectors? Show that this is the case empirically. Estimate the constant for your implementation and computer. Compare using the implementation of the dot product in numpy with your own implementation.

3. Write a function which computes cosine similarity directly from sparse (dictionary) representations without converting them into dense (vector) representations. Test your function for correctness and compare its efficiency theoretically and empirically to (i) your previous implementation of the cosine similarity, and (ii) your implementation of Jaccard’s measure.

4. Write a function which computes all-pairs similarities for a collection of documents. The function should take a list of dictionaries (the document collection) and a parameter specifying the similarity measure to be used. What is the theoretical worst-case running time for computing all-pairs similarities? Does it matter what the similarity measure is? Can you give an estimate of how long it would take to compute all-pairs similarities for 200K documents for both measures? (Note: that whilst you should test your function for all-pairs similarities (with d > 10), you do not need to prove the theoretical worst case empirically or test with 200K documents!)

5. Write a function that implements all-pairs similarities for documents and uses some form of parallel computing, e.g. MapReduce. Make sure you test your function empirically for correctness and for efficiency. Investigate the number of parallel processes that gives optimal results for your implementation and computer.

Marking Criteria and Requirements

This coursework will be marked out of 100, based on the following criteria:

Overall quality of report: 10 marks

This concerns issues such as writing style, organisation of material, clarity of presentation, etc.

• Your report should be no more than 2000 words in length (excluding the content of graphs, tables, references and code). You should specify the length of your report.

• Your report must be a Jupyter notebook. It is not acceptable to submit a pdf or a Word document instead – if you do then your coursework will not be marked.

• You should use a formal writing style.

• A good way to organise the report would be to have seven sections: an introductory section; a section for each of the 5 questions above; and a conclusions / summary section.

• Use subsections with meaningful headings.

Quality of code: 25 marks

This concerns issues such as correctness and clarity of code.

• Since you are submitting a Jupiter notebook, it is very easy for us to check that your code actually works. Make sure it does. If it requires any special libraries to be installed you should make this clear.

• You are encouraged to use library methods other than for the particular algorithms being investigated.

• You should use functions (or classes) to modularize your code.

• You should use comments where appropriate in your code.

• You should test functions thoroughly. Make sure you state whether there are particular inputs for which your functions will not behave as expected.

• The Jupyter notebook / report that you submit should include all of the Python code you have written or adapted i.e., all code that you have used in your experiments other than library methods.

Technical understanding: 15 marks

This concerns how well you have explained each of the algorithms that you have used in your experiments.

• Algorithms should be explained in sufficient detail that a well-educated data scientist who does not know these particular algorithms will be able to understand them.

• Do not just state the theoretical complexity of algorithms – make sure you discuss how the structure of the algorithm leads to this theoretical result.

• Do not just repeat details directly from the lecture notes or other sources. Explanations should be in your own words and contain examples from your experiments.

Quality of analysis: 50 marks (10 marks for each of the 5 questions)

This concerns the quality of your answers to each of the questions.

• Make sure you answer each question fully. If you are unable to do so, be explicit about it.

• You should use an appropriate experimental method to determine values empirically.

• You should use graphs where appropriate in the presentation of your results.

• You should discuss any variations between theory and empirical findings.

• Do not give values to more significant figures than justified.