Monday, August 6, 2018

73.2

list_replicas()
- It was realized that both pilot and rucio server called list_replicas() which is unnecessary and has increased the load on the rucio servers due to the ongoing migration to use rucio as sitemover and since rucio will call list_replicas() for each input file download. A quick fix for this is to use the --pfn option with rucio download which will prevent rucio from also calling list_replicas(). It will however bypass useful features including fallbacks so a better solution is being implemented on the rucio side (which will add a locally cached metadata file) which will require another pilot update during the next couple of weeks.

Wrong error message
- It was discovered that the error message "Payload exceeded max allowed memory" was overwritten by the error message for a kill signal which thus ended up on the monitor page for the failed job. This should now be fixed. Reported by R. Walker

Thursday, July 12, 2018

73.1

A new pilot version has been released with a minor update:
In debug mode, the pilot now scans for the latest updated payload log file and sends its tail with each heartbeat (every five minutes). Requested by R. Walker.

Thursday, July 5, 2018

73.0

Containers
- No formal container development in Pilot 1 - all container testing is done with Pilot 2
- New pilot instruction arriving with job parameters (--containerimage ) removed from job parameters in case it is present (i.e. only acted on in Pilot 2)

Pilot timing
- Added on-the-fly measurement of CPU consumption time
- Pilot now reports this timing in job updates

Tracing
- Removed any present escape characters from stateReason
- Now reporting localSite properly in traces

Google updates
- Added https:// as approved protocol for direct access
- Added escape character for &, needed for turls

LAPP debugging
- Added detailed rucio output to log

Event service
- Now using killpg instead of kill, to include child processes in time-outs
- Now allowing ES merge jobs to select closest inputs
- On-the-fly CPU consumption time also reported for ES jobs

Contributions from W. Guan, M. Lassnig, N. Magini, P. Nilsson.

Friday, May 18, 2018

72.11

Rucio copytool update (from M. Lassnig):
- Removed troublesome API fallback
- Added -v option for more verbose output, requested by Stephane Jezequel

Google testing (from M. Lassnig):
- Added https protocol to schemas used in replica resolution algorithm

Bug fix:
- Changed logger.warning -> pUtil.tolog in detect_client_location(), reported by Javier Sanchez Martinez

Tuesday, May 8, 2018

72.10

The pilot has been updated for an issue seen (at least) at QMUL with metadata containing garbage data. Requested by R. Walker.

Wednesday, May 2, 2018

72.9

Fix to size of workdir not within allowed limit
- Remeasuring size of workdir after removing input files to fix a bad log message and esp. to avoid sending a wrong value to job metrics
- Requested by R. Walker

Zipmapping
- After pilot+server+rucio updates, Archive_tf can now be used as intended (zips input files, server knows about metadata already)
- First we tested skipping Archive_tf since it packed files before pilot could create metadata, then we let pilot create the archive.
Turned out that server couldn’t handle additional metadata since it believed it came from skipped input files (old mechanism)

Fix for Harvester kill worker
- Requested by F. Lin

Using rucio API for stage-out when rucio upload fails
- Previous problems with exceptions in the client have been resolved (by T. Wegner)

Event service
- Removed postExec AthenaMP option for PoolFileCatalog

Fix added after the ADC Weekly meeting:
- Now setting proper pilot error code in case of lsm stage-in failures due to checksum verification issues.

Code contributions from W. Guan, T. Javurek, A. Anisenkov, P. Nilsson.

Thursday, April 5, 2018

72.8

A new pilot version was just released with the following changes (presented in the ADC Weekly last week):

* Simplification of the PoolFileCatalog
- Removal of LFN from PFC
- Problem with derivations when having many input files
- Requested by R. Walker

* Memory monitor setup
- Added missing platform string for Nordugrid
- Requested by D. Cameron

* GFAL-copy update
- Support of dynafed+cloud ddm endpoint was added

* More accurate time HH:MM calculation for proxy validation command "grid-proxy-info -exists -valid HH:MM”
- Previous version failed on short time limits
- Note: arcproxy is used for most sites

* Event service update
- Now using different message server after Prefetcher restart (in case previous instance not killed properly)
- Note: problem fixed properly by adding session id to subprocess.Popen command to allow for killing entire process group with os.killpg()
- This fix was also applied for non-event service case

Additional update added since the ADC Weekly meeting last week:

* Container update
- Now using x86_64-centos6.img for SLC6 platforms as default image

Code contributions from A. Bogdanchikov, N. Magini, P. Nilsson.