Everyone presented their latest work, and discussed the latest image datasets from AstraZeneca and Vironova. During the software workshop session, we discussed linking the HASTE cloud pipeline to the Vironova MiniTEM.
Thanks to: Carolina Wählby, Ola Spjuth, Andreas Hellander, Ida-Maria Sintorn, Alan Sabirsh, Ernst Ahlberg Helgee, Johan Karlsson, Håkan Wieslander, Philip Harrison, Salman Toor, Ben Blamey, Håkan Öhrn, Markus M. Hilscher, Niharika Gauraha, Magnus Larsson, Oliver Stein, Andy Ishak
HASTE has been featured in ‘Framtidens Forskning’: “As more and more instruments are generating more and more data, we need new methods to not completely drown in data volumes. Our tools make it possible to know in advance where to focus the analysis, which greatly reduces time-consuming and streamlines resource usage” said Prof. Carolina Wählby, Principle Investigator for the HASTE project. Read the full article.
The HASTE team are pleased to announce the availability of a new publication of the arXiv pre-print service: ‘Apache Spark Streaming and HarmonicIO: A Performance and Architecture Comparison‘. We performed a benchmark analysis to compare two stream processing frameworks – the popular, Apache Spark framework, widely used in industry, and our own framework HarmonicIO (presented this summer at IEEE Cloud 2018 in San Francisco ).
Previous studies have demonstrated that Apache Spark, Flink and related frameworks can perform stream processing at very high frequencies, but they tend to focus on small messages with a computationally light ‘map’ stage for each message; a common enterprise use case (for example, processing JSON documents). In academic HPC contexts, we often want to analyze larger messages, with more CPU-intensive computations. Our study adds to these benchmarks by broadening the domain to include such processing loads – larger messages (leading to network-bound throughput), and that are computationally intensive (leading to CPU-bound throughput) in the map phase; in order to evaluate applicability of these frameworks to scientific computing applications.
We find that relative performance varies considerably across this domain, with the chosen means of stream source integration having a big impact. Most interestingly, we find that Spark performs very well for large (~10Mb) and small message sizes (~1Kb), but for medium-sized messages, it can be out-performed by HarmonicIO in some configurations. These message sizes are relevant to HASTE, because such file sizes are typical of microscopy applications.
We offer recommendations for choosing and configuring the frameworks, and present a benchmarking toolset developed for this study.
We had a successful project meeting in Uppsala/Stockholm last month – Håkan Wieslander presented his latest research on image feature analysis, Phil Harrison his latest conformal prediction models, Ben Blamey demonstrated the prototype HASTE pipeline, Niharika Gauraha her work on SVM+. Alan Sabirsh and Johan Karlsson explained a little more about their work at Astrazeneca.
On day 2, we visited Vironova in Stockholm, and were treated to a hands-on demo of their MiniTEM electron microscope – and discussed plans for the next project phase.
Oliver’s MSc thesis will investigate intelligent ways to manage and position docker containers in a VM environment, in order to improve efficiency in physical resource usage and maintain performance. The implementation of such a controller system will be developed in coordination with the HarmonicIO streaming framework used in HASTE, which will help the automatic scaling of containers working in the system as well as evaluate the design with a real use case.
Discovering new drugs is becoming more costly. Lars Carlsson gave a presentation Machine Learning For Smarter Drug Discovery at RISE SICS Data Science & AI Day, Nov 28, 2017, where he gave some examples of how AstraZeneca is trying to improve the drug discovery phases through the use of machine learning.
We welcome Phil Harrison as new PhD Student in the Spjuth lab. Phil obtained his first PhD in marine biology in 2006 studying the population dynamics of grey seals. Between 2006-2016 he undertook several research projects modelling wildlife populations and analysing trends in biodiversity. In the HASTE project, Phil will develop machine learning methods for online, large-scale analysis of microscopy image data based on statistical earning including e.g. conformal prediction and probabilistic prediction.
We are nearing the end of an intensive recruitment period, looking for excellent established and emergent scientists to help us realize the goals of this interdisciplinary project.
This week we are very pleased to welcome Dr. Ben Blamey to the team. He will work in the Hellander lab, in close collaboration with Dr. Salman Toor, and focus on computer science challenges in designing and developing smart and efficient systems for managing scientific data, and image data in particular, in distributed computing infrastructure such as hybrid and fog cloud.
With a background on research in machine learning, natural language processing and in development of services in cloud infrastructure both in academia and in industry, Dr. Blamey brings critical experience to the team.
In the featured image Dr. Blamey (right) is busy discussing a potential design of an intelligent system to manage information hierarchies in distributed environments with Dr. Toor (left).
We are happy to welcome Håkan Wieslander to the team and to PhD education at the department of Information Technology, Uppsala University!
Håkan grew up in Lund, Sweden and moved to Uppsala 2011 to study Engineering Physics. In 2017 he obtained a masters degree in computational science. The MSc thesis was about classification of malignant cells using deep learning.
About the PhD project within HASTE:
Collection of large amounts of data often results in high-quality, highly informative data intermixed with data that is either of poor quality or of little interest in relation to the question at hand. Wieslander’s thesis work will focus on development of computationally inexpensive measurements that will identify non-informative data early on in the analysis process; either online at data collection, or off-line prior to full data analysis. The challenge is to use minimal computational time and power to extract a broad range of informative measurements from spatial-, temporal-, and multi-parametric image data, useful as input for conformal predictions and efficient enough to work well in a streaming setting.