Imagine, you have a collection of d (d > 10) documents each containing at least w (w > 50) English words. For testing purposes, you may wish to construct or collect such a collection

Algorithmic Data Science

Scenario

Imagine, you have a collection of d (d > 10) documents each containing at least w (w > 50) English words. For testing purposes, you may wish to construct or collect such a collection. A document can be represented as a bag of words, i.e., as a set of words, it contains together with associated frequencies.

This is most naturally stored using a Python dictionary – and this is called a sparse representation of the data. Alternatively, the dictionary can be converted to a vector (e.g., a NumPy array) where there is a dimension for each of the V words in the English vocabulary.

For this assessment, you are asked to submit a Jupyter notebook that reports on what you have learned about the complexity of different algorithms for finding the similarities of documents using a bag-of-words representation.