Arrays/Java/Finding Duplicates: Difference between revisions

Latest revision as of 02:39, 6 June 2017

Questions

Finding one duplicate

Question C-3.17) A is an array of size n>=2 containing integers from 1 to n-1, inclusive, one of which is repeated. Describe an algorithm for finding the integer in A that is repeated.

A special case of this question is when array A is sorted, since then we know that each array element should correspond to its own index plus 1, and we can do a binary search to locate the array index that does not match the criteria. In this case, the runtime is O(log n).

Finding multiple duplicates

Question C-3.18) Let B be an array of size n>=6 containing integers from 1 to n-5, inclusive. Five of the integers are repeated. Describe an algorithm for finding the five integers in B that are repeated.

Solution approach

There are a few different solution approaches, depending on the problem constraints:

Is the array sorted, or unsorted?
What if we have a one-time O(n) cost of construction?
Are we after a space-efficient or operationally-efficient algorithm?
What kind of data - primitive types or objects?
Do we have to use an array? Why not use a collections object?

(Square-bracket indexing is nice to have.)

Single element case

The cost of finding duplicates depends on problem parameters and tradeoffs. Assuming we want the absolute fastest algorithm, we are looking for a constant-time or logarithmic algorithm. (There are no O(1) algorithms here.)

Log time algorithms:

O(n log n) one-time cost to sort
O(log n) lookup time in a binary tree (less memory)
O(1) lookup time in a hash table (more memory)

In the particular case given, of the integers 1 through n corresponding to indices 0 through (n-1), we are looking for an element that meets a particular criteria: normally in binary sort the criteria is a (greater than/less than/equal to) condition. Here, the criteria is the place where the value of element i (at zero-indexed i) changes from i+1 to i. The binary search algorithm can be outfitted to find this change point, using its head, tail, and middle pointers.

Linear time algorithms:

O(n) add each element to a set, checking membership each time. O(1) lookup for membership in a set, iterating one item at a time, leads to O(n) time.

One-time expense algorithms:

O(1) lookup for O(n) cost of building a hash table, at a potentially high memory cost.
O(log n) binary search lookup for O(n log n) cost of sorting into a binary tree or similar structure, more efficient memory usage.

If we are going to be moving items in and out, what's the best approach? It all depends... Let's explore.

NOTE: Remember the ghost log. The cost of doing something with frequency of 1/j for j=1 to n is the harmonic number, which is O(log n).

Java Implementation:

Use a HashMap, with the key being the element in the array, the value being an integer counter.

Multi element case

In the multi-element duplicate case:

Code

@@ Line 24: / Line 24: @@
 ===Single element case===
-The fastest case is using a sorted structure, such as a binary search tree, to constructor (or reconstruct) the object.
+The cost of finding duplicates depends on problem parameters and tradeoffs. Assuming we want the absolute fastest algorithm, we are looking for a constant-time or logarithmic algorithm. (There are no O(1) algorithms here.)
-To check for a duplicate:
+'''Log time algorithms:'''
+* O(n log n) one-time cost to sort
+* O(log n) lookup time in a binary tree (less memory)
+* O(1) lookup time in a hash table (more memory)
-Using arrays:
+In the particular case given, of the integers 1 through n corresponding to indices 0 through (n-1), we are looking for an element that meets a particular criteria: normally in binary sort the criteria is a (greater than/less than/equal to) condition. Here, the criteria is the place where the value of element i (at zero-indexed i) changes from i+1 to i. The binary search algorithm can be outfitted to find this change point, using its head, tail, and middle pointers.
-Sorted array, single element duplicate:
+'''Linear time algorithms:'''
-* O(log n) - each element's index correlates to the element's value, so use a binary search to find the location where the correlation is off
+* O(n) add each element to a set, checking membership each time. O(1) lookup for membership in a set, iterating one item at a time, leads to O(n) time.
-Unsorted array, single element duplicate:
+'''One-time expense algorithms:'''
-* O(n) - to iterate through each element and check for membership in a set: if not a member then add it, otherwise it's a duplicate element.
+* O(1) lookup for O(n) cost of building a hash table, at a potentially high memory cost.
+* O(log n) binary search lookup for O(n log n) cost of sorting into a binary tree or similar structure, more efficient memory usage.
-Using trees:
+If we are going to be moving items in and out, what's the best approach? It all depends... Let's explore.
-* If we can build a search tree filled with references, the cost of constructing the tree is O(n), with each node being composed of a fixed number of pointers (to parent/child nodes).
-* Underlying data can be anything - a bunch of indexes in an array, or items in a queue, or pointers to locations in memory, etc., but must be accessible in O(1) time.
-To get a slightly more expensive answer but make it reusable,
+NOTE: Remember the ghost log. The cost of doing something with frequency of 1/j for j=1 to n is the harmonic number, which is O(log n).
-* O(n log n) - build a binary tree or other sorted data structure. Comoparable to cost of one-time sort. Heap sort is basically constructing a binary tree and doing a binary tree sort.
-* One-time cost, it is now sorted, and can do O(log n) searches. Like binary searches.
-Add entries to a counts hashtable. O(1) lookups for each element in hash table. sum them up.
+'''Java Implementation:'''
-* O(n) to add each element to a hashtable and get back a counter that is incremented.
+* Use a HashMap, with the key being the element in the array, the value being an integer counter.
-In-place sort and binary search for duplicates:
-* O(n log n) - to sort the array, O(n log n), then costs O(log n) to find duplicates, sequentially, using binary search
 ===Multi element case===