Solving the DNA Splicer
In Challenge 5 of the 2018 Tuenti Challenge, we are given DNA strands (strings containing a mix of
aAcCgGtT), and asked to find how we can recombine the strands into two matching sequences. Not all of the strands need to be in the final identical sequences. For example, given the input
aacc GgTTC aaccG gTTC aaaa cccc gttc, we need to discover that the sequence
aaccGgTTC can be built twice:
aaccG+gTTC. We then return the positions of the parts we used to create the two identical sequences, in this case 1, 2, 3 and 4.
This is an interesting problem, and I encourage you to take a few minutes to think about how you would solve this before reading on.
The maximum number of parts (18) combined with the time constraints rule out the option of enumerating all the possible combinations of parts. We need to find a slightly more efficient algorihtm.
Let’s start by assuming a certain part, call it X, is the first part of one of the final sequences. Then we know that there must be another part, Y, where one of the following is true:
- Y is longer than X, and starts with all the characters in X, or
- Y is shorter than X, and all its characters match with the first characters of X, or
- Y is exactly equal to X, in both length and value
If there is no such part Y, our hypothesis is invalidated, and X cannot be the start of the final sequence. If there is a part Y, we can now ignore all the matching characters in X and Y, and continue. In our example above, let’s say we found X =
aacc and Y =
aaccG. The remaining characters that did not match, in this case
G, now becomes the new X, and we repeat the procedure from before, remembering that we can’t use part X or Y anymore. We know that there must be another part where one of the 3 cases from above is true. If there is not, we backtrack until we find another X, Y pair that has not yet been considered.
If we ever hit case 3 (i.e. X and Y match exactly) we must have two identical sequences, and we are done! If we consider every possible starting part as X, then we must have covered all the possible combinations. We just need to keep track of the parts we used to get to case 3, and then return those parts as the answer. The following Python code implements this algorithm. (You can ignore the telnet-related code: it is for interacting with the challenge server.)
from __future__ import print_function, unicode_literals, division from collections import namedtuple, defaultdict, deque import itertools import telnetlib def find_identical_dna_sequences(parts): """ Recursively check string concatenations and eliminate impossible branches quickly. """ used = [0 for i in range(len(parts))] for i, part in enumerate(parts): X = part Y = "" used[i] = 1 ans = _find_matching_parts(X, Y, parts, used) if ans is not None: return ",".join([str(i+1) for i in range(len(used)) if used[i]]) used[i] = 0 return None def _find_matching_parts(X, Y, parts, used): # case 3 if X == Y: return used if len(Y) > len(X): X, Y = Y, X diff = X[len(Y):] for i, part in enumerate(parts): if used[i]: continue # case 1 and 2 if part.startswith(diff) or diff.startswith(part): used[i] = 1 ans = _find_matching_parts(X, Y+part, parts, used) if ans is not None: return ans used[i] = 0 return None tn = telnetlib.Telnet("188.8.131.52", port="3241") tn.read_until('> Please, provide "TEST" or "SUBMIT"'.encode('ascii')) tn.write("SUBMIT\n".encode('ascii')) s = tn.read_until("\n".encode('ascii')) while s: s = s.decode('ascii').strip() if not s or s.startswith(">"): print("reading", s) s = tn.read_until("\n".encode('ascii')) continue ans = find_identical_dna_sequences(s.split(" ")) print(ans) tn.write((ans + "\n").encode('ascii')) s = tn.read_until("\n".encode('ascii'))
Though it was not necessary in this instance, we could have improved the algorithm by using a Radix Tree-like structure to more efficiently find parts with matching prefixes.