Meeting Summary and Notes

EXECUTIVE SUMMARY

The “Big Data Problems in Radio Astronomy” meeting was held from 9:30am – 4:00pm, Friday 3rd February 2017, at the Mellon Institute of Carnegie Mellon University (CMU). There were approximately 30 attendees, from the following places: various computer science-related departments at CMU; the CMU Physics Department and Statistics Department; the Pittsburgh Supercomputing Center (PSC); the West Virginia University (WVU) Physics and Astronomy Department; the WVU Computer Science and Electrical Engineering Department;  the Green Bank Observatory (GBO); and the Google Pittsburgh Office. The format consisted of a series of topical presentations in the morning and early afternoon (to be posted online soon), followed by a session of round-table discussions to identify the key topics of interest, and areas of potential collaboration. A set of shared notes (appended below) was taken in real time.

The meeting was very productive, and it was quite apparent that there is both interest, and the opportunity to develop potentially highly productive collaborations. A number of compelling Use Cases were identified for further study. Unfortunately, the loss of a few key CMU computer scientists, and lack of time at the end of the meeting, prevented us from firming up the details in real time. Accordingly, we make the following recommendations:

  1. We should retain the “bigdata” mailing list, and continue our discussions offline.
  2. We propose a second meeting, to be held at WVU, approximately three to six months from now (June – August). We should strive to have firmer ideas formulated in advance of that meeting (provisional organizer Sarah Burke-Spolaor).
  3. At this point, the most promising areas for collaboration are:
    1. Processing GBT Argus spectral line data through the GBT data analysis pipeline, which is already installed at the PSC (Nichol Cunningham, GBO, Joel Welling, PSC).
    2. Investigating potentially more effective artificial intelligence approaches, which can be guided by our knowledge of the problem domain (e.g. the precise dispersion relationship that pulsars and transient sources must obey). This might be informed by recent, major advances in machine vision (team to be decided).

Richard Prestage, Rachel Mandelbaum, Mark Whitehead (post-facto, self-appointed Scientific Organizing Committee), 9th February 2017.

 

REAL-TIME NOTES

(Almost exclusively compiled by Mark Whitehead)

Introduction

Richard (GBO)

  • Single Dish versus Interferometer (relevant for processing requirements).
  • GBT, VLA, SKA example data rates all qualify as Big Data.
  • GBT decimates data but keeping the data could improve science results.
  • Interested in collaborating on optimizing system engineering, faster data reduction capabilities, and more sophisticated information extraction.
  • (Dan) Estimated 30 PB/yr possible data throughput for GBO data link.
  • (Mike Levine) PSC 20PB data storage availability.

Big Data Challenges in Radio Astronomy

DJ Pisano (WVU)

  • Processing VLA spectral line data (CHILES)
  • 130TB, 30h to reduce 6h of data, currently throttling data acquisition.
  • Custom pipeline requires human intervention to assess data quality.
  • Data structure is complex, viz requires averaging.
  • Calibration is also complex (including flagging for interference), assessing calibration quality is challenging.
  • Data quality assessment (DQA) requires averaging the data because the existing software is too slow for current data set sizes.
  • DQA requires visualizing multidimensional data primarily to eliminate RFI.
  • Imaging is accomplished on AWS.

Nicole Cunningham (GBO)

  • Also analyzing spectral line issues, problems with extracting useful inferences from data.
  • Spectral line cube – two spatial dim, one freq dim.
  • Described data structure (pixels) and typical data size and processing time for a single spectral line. Typically smooth data to process cubes more quickly.
  • Relies on moment maps which results in less ‘data fidelity’ but is related to the entire data processing approach. (?)
  • Making inferences about the physics requires multiple lines, multiple wavelengths over many observed regions.

Ryan Lynch (GBO/NanoGrav)

  • Continuum/Pulsar
  • MSP observations/science requires fast sampling, high spectral resolution, and searching each pixel independently.
  • Tension between observation duration and computational data processing complexity.
  • Processing is embarrassingly parallel, NanoGrav uses SC at McGill.
  • Real-time searches required …
  • Have been searching for pulsar candidates using human resources, it would be helpful if that process could be automated. Current process relies on statistical measures to detect candidates. Perhaps new algorithms or ML could help here.
  • It is useful to reanalyze data sets which requires long-term storage.
  • Pulsar timing calculations is also computationally challenging. There are other figures of merit that are not calculated because the calculations can’t be made.

Richard Prestage (GBO), channeling Dan Werthimer

  • SETI Big data with the GBT et al
  • SETI@GBT 2-8 PB/year of candidate data (not full data)
  • Analysis pipeline includes a series of filters, primarily composed of RFI rejection and pattern detection, to produce a set of candidates for follow-up observation and data processing.
  • Largest problem is large number of false positives and the goal is to avoid throwing away positives during the filtering processes.
  • Real Time, automated RFI Mitigation should be part of the observatory’s operations.
  • Limited by algorithms and processing capability.
  • Approaches have been prototyped to resolve this (e.g. using reference antennae) but nothing has been implemented for production observing systems.

Maura McLaughlin (WVU) channeling Sarah Burke-Spolaor (WVU)

  • FRBs – Big Data, SC and ML
  • Need distance estimates and localization to learn about various physics.
  • Limited by large number of false alarms (RFI) embedded in large datasets (10sTB – 0.5PB for about 650 hrs of non-continuous observations over a year).
  • Interested in ML to differentiate FRB from noise.
  • Localization also constraints FRB sources and makes it easier to eliminate RFI.
  • Interested in RT data pipeline eliminating need to save large data sets.
  • ML could help classify FRB, RFI (current and future), Unknown, signals etc…
  • FRBs have a distinct, unique(?) DSP signature.

Jeff Peterson (CMU)

  • FRB Detection with HIRAX and array-based systems.
  • Good localization is an inherent aspect of this detection system.
  • The project is not saving (Exabyte) data sizes due to funding limitations. Should it?
  • DSP-based, specialized hardware forms basis of data processing – constrained by power and Moore’s Law limitations.
  • Must have automated (ML) algorithms to find and reject RFI and make detections with these data  set sizes (possible: Exabytes/year, actual 8TB/day). How can we cost-effectively save and analyze 1 Exabyte/yr? Should we? (Decided probably not.)
  • Wants fast search algorithms to find accelerated pulsars.
  • Interested in a sustained pipeline of generations of ASIC processors.

Aaron Ewall-Wice  (MIT) 21cm Intensity Mapping  (HERA)

  • 80% of observable universe is unobserved.
  • During cosmic dawn and dark ages, H existed and we use detections of the emissions from H gas to map distribution. This permits probing some of the unobserved universe.
  • Expecting 324 TB/day data rates, 14PB needed to detect signal.
  • Data processing challenges include foregrounds, calibration, RFI flagging, antenna-antenna correlation coordinate transforms, map/model fitting.

Big Data and Computer Science Solutions

Mike Levine, Joel Welling, J. Ray Scott (PSC)

  • Big Data Capabilities
    • File System Support for (geographically) distributed systems
      • Data Exacell (NSF DIBBS, Cyberinfrastructure)
      • SLASH2 file system
      • System goal is to reduce complexity associated with storing and finding large datasets that may be geographically distributed.
      • Relies on metadata and collection of servers to simulate a single namespace.
      • Includes support for redundancy, error detection/correction and data geo-location (i.e. provide good local read-only performance over wide area).
      • SLASH2/Lustre are comparable.
    • PSC goal is to support scientific achievement.
      • Provides data storage, compute resources, computation support, networking infrastructure.
    • Domain/General HPC expertise
    • Collaboration
      • Advanced User Support Grant (20% PSC FTE for 6mo)
      • Proposals
  • Big Data Experience
    • Multiple deployments of ~PB I/O servers (SLASH2).
    • Deployed SLASH2 at GBO with aim of distributing data between UVa, GBT and PSC.
    • GBT Mapping Pipeline, Electron Microscopy

Katerina Goseva-Popstojanova (WVU, CS-EE)

  • Goal: Automatic ML-based classification/identification of (single-pulse) pulsar (dispersed pulse groups) candidates.
  • Started with pre-processed data then submitted data to identification (recursive peak identification algorithm, custom?) and classification (six standard supervised ML algorithms) stages.
  • Tedious to create training data sets for supervised ML algorithms.
  • Found that Random Forest (RF) ensemble tree learner provided best overall performance.
  • Further data analysis using RF permitted selection of classifiers used to search for pulsars which were also applied to training data.
  • Observation data products consisted of huge number of files and required data pre-processing to distill data products appropriate for ML data analysis. Best practices for data pipeline analytics?
  • This domain is highly parallelizable, enables options with Hadoop, Spark, etc.
  • Planning for online streaming ML algorithms for pseudo-RT classification.

Dan Dennison (Google)

  • Big Data Processing on Google Cloud Platform
  • Suite of Google Tools: GFS, MapReduce, BigTable, TensorFlow, Flume, Millwheel, F1, etc.
  • These tools map to ML, DataFlow, DataStore, Storage, DataProc, Bigtable, Cloud Storage functionality.
  • Intended to play well with other Google/Non-google open source projects, especially w.r.t. containers. Aims to reduce effort to set up infrastructure so most effort goes into solving domain-specific problems.
  • Apache Beam Vision – aimed at pipeline developers, SDK writers, runner writers.
  • App Engine can be used to train AIs.
  • $30B investment near term in data centers located where science happens.
  • Grant program to make available GC to students in course setting. (Education)
  • Collaboration with NSF (Research), BIGDATA, deadlines in March.
  • Collaboration with Internet2 is in the works (perhaps agreement in next few months). Google-I2 webinar (Feb 16 @ 1PM EST)
  • Question (Rachel): What resources are available to match problems with ML techniques? TODO: Dave Anderson to try to share a useable flowchart.

Andrew Moore (CMU CS)

  • Expertize and Projects at CMU CS and ECE
  • Interests: Large Scale ML and Large Scale statistics
  • Data Science Stack (ascending):
    • New hardware (e.g. ASICS, BigRAM)
    • Middleware (Spark Stack, Container Models, Optimizations for pipeline processing and searches)
    • Mathematical and Statistical tools emerging to support/augment approaches to the problems we’ve discussed today.
  • Aim for leaving here today with explicit use cases for some of the problems we’ve discussed today. Easier to achieve collaborations with organizations who want to show benefits of exascale/petascale applications.

Review and Next Steps

Questions

  • How do we collaborate with Google? TODO: Dan will refer Richard who can address this.
  • Is anyone researching ways to methodically provision systems? Dan and Mike indicate the market (e.g. pricing) is too volatile to make this a methodical process. Extended discussion ensued.
  • DJ asked about costs for GC computing? Dan responded that Google plans to offer a discount which is distinct from grant options Dan discussed in his talk. Research outreach is in the works but it is TBD.
  • DJ – AWS costs for processing and data egress. Does Google take the same approach? Dan googled commercial rates, sounds like similar operations model to AWS, except Google has significantly cheaper rates for warm, cold and less redundant storage.

Agreements

  • Nicole: Planning to work with PSC on reducing time it takes to process ARGUS data.

Potential Areas of Collaboration:

  • File systems (distributed, larger…)
  • Machine learning (will Katerina share her training / test data-set)? Extractor…
  • Getting GBO problems / software running on PSC / in the Google Cloud?
  • Data Visualization / information extraction

Use Cases (Mark’s Version)

  • GBO Real Time RFI Flagging/Mitigation
  • Automated VLA Spectral Line DQA
  • Novel ARGUS Data Processing Approach
  • Algorithm Development for Pulsar Candidate Detection
  • Automated FRB Localization and Classification
  • HIRAX Automated RFI Detection/Mitigation
  • HERA Big Data Computing

Use Cases (Speakers’ Versions)

  • DJ: Automated RFI flagging/mitigation (calibration versus target data analysis for imaging), using pipeline model.
  • Nicole: need (mathematically, statistically) novel ARGUS data processing approach.
  • Ryan: Advanced RFI mitigation as close to detector as possible.