Scalability and performance testing of an R-GMA based Grid job monitoring system for CMS data production
Experiments such as CMS (Compact Muon Solenoid, at CERN) have enormous computing requirements for both simulation and subsequent analysis of the recorded data.
Within CMS, BOSS [1,2] was developed as a job monitoring framework within the context of local batch farms. Deployment of BOSS on to the Grid would be problematic as it requires direct access to the DBMS from running jobs, raising concerns regarding network access, firewalls, and the distribution of DBMS access credentials to remote sites. We therefore investigated using R-GMA [3] to transport BOSS' monitoring messages from jobs running across LCG testbeds back to a database local to a user.
We have written bossmin (C++), a BOSS "emulator" which publishes into R-GMA simple monitoring messages corresponding to a single test job, and bossminj (Java) which comprises both a CMS job simulator and message publisher, and a corresponding archiver to log messages received via R-GMA into a local database. Each bossminj "simulation" task can masquerade as a large number of individual CMS production jobs ("simjobs"), allowing us to stress the R-GMA framework without using significant CPU resources at the remote sites.
By comparing the messages submitted to R-GMA by the remote bossminj instances (logged within text files returned via the usual Grid job sandbox mechanism) with those received from R-GMA and stored in the local BOSS database we were able to assess the performance and scalability of the R-GMA framework.
Tests on a dedicated testbed in 2003 initially struggled to monitor 400 jobs [4,5] but after improvements to both the code and the infrastructure, the framework was able to monitor 6000 virtual jobs [6].
In October 2005 we tracked 1000 simultaneously-running virtual jobs [7,8] across the LCG 2.6.0 Grid for 6 hours. Of 23000 simjobs submitted, 14000 (61%) ran at a remote site, of which 13683 (98%) transferred all of their messages into our local database. Every single one of the 1017052 individual messages logged as published into R-GMA was also transferred successfully.
CounterDemo is a simplified demonstration of message publishing with R-GMA.
Material
bossmin_v2.1.zip (3/10/2003): bossmin (v2.1) - BOSS emulator (for R-GMA 3.2.22, for testing basic R-GMA functionality).
bossmin_v2.3.zip (7/10/2004): bossmin (v2.3) - BOSS emulator (for LCG 2.2.0 testbed, for testing basic R-GMA functionality).
bossminj-NSS05.zip (11/11/2005): R-GMA/BOSS tests for IEEE papers (NSS '05 version, for LCG 2.6.0).
CounterDemo_v1.0.zip (28/08/2003): CounterDemo (v1.0) - demo/test.
BOSSRGMAtestResults03.zip (3/03/2004): Output files from Grid submissions.
ee_results.tar.gz (27/09/2005): Output files from Grid submissions.
Young-rgma_res.zip (10/11/2006): Output files from Grid submissions.
young.HistTable.sql.gz (23/01/2008): SQL dump of R-GMA HistoryProducer DB table.
young.LPTable.sql.gz (23/01/2008): SQL dump of R-GMA LatestProducer DB table.
Acknowledgements
Henry Nebrensky wrote bossmin (emulating BOSS' job wrapper) and CounterDemo.
Paul Kyberd and Henry Nebrensky wrote bossminj.
Henry Nebrensky submitted the jobs to the Grid, monitored their progress and tabulated the results.
bossmin and CounterDemo are distibuted as Open Source under the terms of the EU DataGrid Software License. bossmin, bossminj and CounterDemo were first made publicly available on the WWW in 2003.
Jobs were submitted to the CMS/LCG0, LCG 2.2.0 and LCG 2.6.0 Grid testbeds. The R-GMA project, as well as this work itself, were supported by GridPP [9] in the UK. Many individuals helped by supporting the underlying Grid and R-GMA frameworks [4-8].
Disclaimer
This data is provided in the form of log files and database dumps as saved to disk over a decade ago - timestamps listed above. Supporting information is mostly from memory.
References
1. C. Grandi and A. Renzi: "Object Based System for Batch Job Submission and Monitoring (BOSS)" CMS Note 2003/005 (2003)
2. C. Grandi: "BOSS: a tool for batch job monitoring and book-keeping" in CHEP03 - Computing in High Energy and Nuclear Physics, La Jolla, California USA; Conference record THET001 (2003)
3. A. Cooke et al.: "R-GMA: First results after deployment" in CHEP03 - Computing in High Energy and Nuclear Physics, La Jolla, California USA; Conference record MOET004 (2003)
4. D. Bonacorsi et al.: “Scalability tests of R-GMA based grid job monitoring system for CMS Monte Carlo data production” in IEEE Nuclear Science Symposium/Medical Imaging Conference, Portland, Oregon USA; Conference Record 3 pp.1630-1632. DOI: 10.1109/NSSMIC.2003.1352190 (2003)
5. D. Bonacorsi et al.: “Scalability tests of R-GMA-based grid job monitoring system for CMS Monte Carlo data production” IEEE Transactions on Nuclear Science, 51(6) pp.3026-3029. DOI: 10.1109/TNS.2004.839094 (2004)
6. R. Byrom et al.: “Performance of R-GMA based grid job monitoring system for CMS data production” in IEEE Nuclear Science Symposium/Medical Imaging Conference, Rome, Italy; Conference Record 4 pp.2033-2037. DOI: 10.1109/NSSMIC.2004.1462663 (2004)
7. R. Byrom et al.: “Performance of R-GMA for monitoring grid jobs for CMS data production” in IEEE Nuclear Science Symposium/Medical Imaging Conference, Fajardo, Puerto Rico; Conference Record pp.860-864 DOI: 10.1109/NSSMIC.2005.1596391 (2005)
8. R. Byrom et al.: “Performance of R-GMA for monitoring grid jobs for CMS data production” poster shown at IEEE Nuclear Science Symposium/Medical Imaging Conference, Fajardo, Puerto Rico, 23rd – 29th October 2005. [ BURA ]
9. The GridPP Collaboration: “GridPP: development of the UK computing Grid for particle physics” Journal of Physics G: Nuclear and Particle Physics, 32(1) pp. N1-N20. DOI: 10.1088/0954-3899/32/1/N01 (2006)