markmac99
36 supporters
Working on Distributed Processing

Working on Distributed Processing

Dec 13, 2021

My biggest bugbear is that as the network has grown, so has the time - and cost - of data processing.

Each morning, we process as many as 25,000 individual potential meteors from around 100 stations, spaning the last two nights (this is necessary to catch any late arriving data from the night before last). That sounds a lot but actually its only around 120-150 per station per night, and if we have a clear night during a meteor shower, it can be much higher, often as many as 500.

Processing this data can take a while - for example the 2021 Perseid dataset from the night of the peak took 13 hours to process - so I am looking at ways to improve this.

Data processing consists of three parts:

  • identifying pairs or tuples of detections which are the same meteor seen by different cameras.

  • calculating the trajectory for each set of data. This uses a monte-carlo model to find a best-fit to the data.

  • creating a web page for each confirmed match


Creating the webpages takes 10-15s per confirmed match and so further work is needed, though its not the top priority.

Pairing up the meteors is pretty quick, so I'm happy with that.

However, calculating a trajectory takes about 45-60s per match. That might not sound like much but when you have 700 potential matches to check, it adds up. The monte-carlo model is already parallelised on a per match basis, running on ten to fourteen cores if available, so there's not much scope for improvement there.

So I've been looking at the possibility of distributed processing.

The idea, suggested to me by Richard Bassom, would be to pair up the meteors on the server and save the candidates. A process could then be run on multiple client computers that collected a candidate, solved it and posted the results back to the server, then fetched another candidate and so on till there were none left to examine. If we could use the Raspberry Pis, then we could have up to 100 nodes running in parallel and even if each pi took ten times as long, we'd finish in a tenth of the time (and at much lower cost). But it might also be possible to use serverless compute on AWS, or containers, or even just distribute the load amongst multiple AWS compute nodes.

So, the game's afoot...

Enjoy this post?

Buy markmac99 a beer

More from markmac99