Parallelization and Hadoop

The short question: Has anyone used Mathematica in conjunction with Hadoop and does Mathematica's built in parallelization play well with Hadoop?

The long version:

So I have a Mathematica program that I would like to do the following:

I have a kernel does some initial computations and produces some sets of equations which it outputs as context files to a bucket of some kind. The way it does this is essentially through searching a binary tree until it either a) finds a solution, b) finds a contradiction and thus prunes that branch, or c) it can't solve it for some reason.
I have several remote kernels running which monitor this directory and pick out context files (which are essentially sets of equations) to try and solve. If they succeed, they throw the solutions that they've found into a bucket for solutions. If they produce more equations to be solved I want them to put them into context files and put them back in the original bucket. If they fail for some reason (which is for all intents and purposes saying that the algorithm I just used to try and solve them did not work), I want them to save the context they are working on as is and put it in a separate bucket that I somehow mark "hard".
I want to have certain kernels which are marked to look into the "hard" bucket and try more intensive algorithms for solving them. I would like for them to do this in an "intelligent" way, whatever that ends up being.
I produce new sets of context files for computation by recursing further down the tree. I would like to (somehow) treat my bucket as a priority queue so that context files generated at greater depth are given priority over those closer to the root.
When all is said and done and I have (hopefully) produced all of the sets of solutions that I can to this system of equations, I want to have a kernel that goes through the sets of solutions and computes when they are equivalent.

We (since this is certainly not a one person effort) have been looking at using the Parallelization capabilities built into Mathematica for this task. Some of the advantages to this are that when I initialize remote kernels, Mathematica is supposed to have a means of making sure that the context running in that kernel has certain appropriate definitions.

There are a few apparent problems we have identified: One is in handling the file distribution. Ideally we would like to make sure that two kernels are not trying to solve the equations in the same context at the same time. What also happens is that at some point all of the remote kernels are doing disk reads and writes from the same directory, which would probably be bad. Additionally, as it stands right now, the way we can think of to do this with the built in Mathematica parallelization requires that all communication go through the original kernel which spawned the process. We would like to decentralize the algorithm to make it as modular as possible. Finally, not all of the problems that this software is used to solve are beyond the realm of a single kernel, however as it stands doing things in the current version using only a single kernel still requires treating the program as parallelized.

I am basically familiar with Hadoop, its DFS, and the MapReduce paradigm that it uses. As I see it, steps 1-4 above could be considered as the map step of an algorithm and step 5 could be a reduce step. Additionally, the HDFS seems like it would provide a solution to the file system problems.

The potential problem with hadoop is how to implement access through Mathematica. I have run across the HadoopLink (https://github.com/shadanan/HadoopLink) project, and it seems like the goal of the project could help provide some of the framework we desire. However, some of us have already done work on implementing a solution using the Mathematica parallelization functions, and it is highly desirable to not have to abandon this code, especially since it's what's likely to be optimized for doing parallelization with Mathematica.

My questions are as follows:

As above, has anyone had any experience with trying to get these two things to work together, and if so, was it worth it?
Does anyone have any experience with the HadoopLink project, and is it compatible with the Mathematica parallelization? I'm emailing the github project owner, but there are also three other forks out there for the project.
Would this be killing a fly with a bazooka and using a 40 lb sledge where a 12oz claw hammer would do? Part of the reason for investigating this is that we anticipate having access to a moderate (few dozen) machines on which we can run kernels. On the one hand we would like to not get stuck thinking too small and having to implement something new all over again as we scale up, and on the other hand we don't want to waste time because we were trying to anticipate problems it was never reasonable to expect in the first place.
Is there a better way to do this? At this point, things are rather exploratory and I have no problems with taking new suggestions.

Thanks!

Latest Images

Trending Articles

Latest Images