Project Description

WebFamulus is a web application using the Quartz scheduler framework for process automation. It provides a web interface to schedule, monitor, and interact with Java processes that are configured as 'jobs'. Access to user and administrative interfaces is based on permissions and group ownership. Likewise, the configuration of jobs, alerts, performance monitoring and statistics is based on group ownership and permissions granted to an individual user. The application is currently under development and a beta-version is available to selected clients.


Typical applications for the Quartz framework are network and hardware monitors, database maintenance, reporting, etc. Tedious, time-consuming tasks can run repeatedly whenever possible following sophisticated schedules, including selected dates and weekdays, black-out times, holidays, etc. The field of application, however, is limited only by what the programming language cannot provide. Quartz makes it possible to schedule and execute automatically just about anything that can be programmed in Java.


WebFamulus, in addition, supports jobs to be executed as sequences based on the (optional) evaluation of the preceding job's exit status. Furthermore, the application is backed by an extensive database that makes it possible to let non-administrative users create, configure, interact with, and monitor their own jobs, alerts, and reports on-line or via electronic mail.


2013-12-03

1. SMART-enabled hard drives and smartmontools

SMART (Self-Monitoring, Analysis and Reporting Technology) is a monitoring system for hard disks devised by manufacturers to detect functional problems and warn of impending failures. SMART-enabled hard disks provide indicators (attribute values) of their present working state and perform self-tests. If the indicators are below specified threshold values and the self-tests are performed without error, the drive passes the overall-health assessment. If an attribute value exceeds the specified threshold value, a drive failure is predicted.

smartmontools

smartmontools is an open-source utility suite (available at SourceForge) to interact with SMART-enabled devices: smartd and smartctl. smartd runs a daemon program that logs SMART attribute values periodically. It can be configured to issue email alerts if an attribute surpasses a threshold value. smartctl is a program to en- and disable SMART functionalities, to query the working state of a hard drive, and to initiate self-tests.

The working state of SMART-enabled hard-disks can be queried with the smartctl tool that is part of smartmontools. smartctl can be configured to provide information on the capabilities and the working state of the device. The application is accessible via the command line and issues a report to STDOUT, which inturn can be captured in a file. The following is a sample output of smartctl with all output options enabled (option -x):

1.
smartctl 5.40 2010-10-16 r3189 [x86_64-apple-darwin10.5.0] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Fujitsu MHW2 BH series Device Model: FUJITSU MHW2120BH Serial Number: NZ0ST6B288UD Firmware Version: 00810013 User Capacity: 120,034,123,776 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 3b Local Time is: Thu Nov 20 07:16:00 2011 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled
2.
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 578) seconds. Offline data collectioncapabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 82) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported.
3.
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 046 Pre-fail Always - 152969 2 Throughput_Performance 0x0005 100 100 030 Pre-fail Offline - 32440320 3 Spin_Up_Time 0x0003 100 100 025 Pre-fail Always - 1 4 Start_Stop_Count 0x0032 095 095 000 Old_age Always - 21119 5 Reallocated_Sector_Ct 0x0033 100 100 024 Pre-fail Always - 0 (2000, 0) 7 Seek_Error_Rate 0x000f 100 100 047 Pre-fail Always - 3012 8 Seek_Time_Performance 0x0005 100 100 019 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 075 075 000 Old_age Always - 12792 10 Spin_Retry_Count 0x0013 100 100 020 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 7616 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 54 193 Load_Cycle_Count 0x0032 082 082 000 Old_age Always - 365307 194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 36 (Min/Max 12/51) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 1871 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 (0, 6389) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000f 100 100 060 Pre-fail Always - 12160 203 Run_Out_Cancel 0x0002 100 100 000 Old_age Always - 3732279524976 240 Head_Flying_Hours 0x003e 200 200 000 Old_age Always - 0
4.
General Purpose Logging (GPL) feature set supported ATA_READ_LOG_EXT (addr=0x00:0x00, page=0, n=1) failed: 48-bit ATA commands not supported Read GP Log Directory failed. SMART Log Directory Version 1 [multi-sector log support] SMART Log at address 0x00 has 1 sectors [Log Directory] SMART Log at address 0x01 has 1 sectors [Summary SMART error log] SMART Log at address 0x02 has 51 sectors [Comprehensive SMART error log] SMART Log at address 0x03 has 64 sectors [Ext. Comprehensive SMART error log] SMART Log at address 0x06 has 1 sectors [SMART self-test log] SMART Log at address 0x07 has 2 sectors [Extended self-test log] SMART Log at address 0x09 has 1 sectors [Selective self-test log] SMART Log at address 0x10 has 1 sectors [NCQ Command Error] SMART Log at address 0x11 has 1 sectors [SATA Phy Event Counters] SMART Log at address 0x80 has 16 sectors [Host vendor specific log] SMART Log at address 0x81 has 16 sectors [Host vendor specific log] SMART Log at address 0x82 has 16 sectors [Host vendor specific log] SMART Log at address 0x83 has 16 sectors [Host vendor specific log] SMART Log at address 0x84 has 16 sectors [Host vendor specific log] SMART Log at address 0x85 has 16 sectors [Host vendor specific log] SMART Log at address 0x86 has 16 sectors [Host vendor specific log] SMART Log at address 0x87 has 16 sectors [Host vendor specific log] SMART Log at address 0x88 has 16 sectors [Host vendor specific log] SMART Log at address 0x89 has 16 sectors [Host vendor specific log] SMART Log at address 0x8a has 16 sectors [Host vendor specific log] SMART Log at address 0x8b has 16 sectors [Host vendor specific log] SMART Log at address 0x8c has 16 sectors [Host vendor specific log] SMART Log at address 0x8d has 16 sectors [Host vendor specific log] SMART Log at address 0x8e has 16 sectors [Host vendor specific log] SMART Log at address 0x8f has 16 sectors [Host vendor specific log] SMART Log at address 0x90 has 16 sectors [Host vendor specific log] SMART Log at address 0x91 has 16 sectors [Host vendor specific log] SMART Log at address 0x92 has 16 sectors [Host vendor specific log] SMART Log at address 0x93 has 16 sectors [Host vendor specific log] SMART Log at address 0x94 has 16 sectors [Host vendor specific log] SMART Log at address 0x95 has 16 sectors [Host vendor specific log] SMART Log at address 0x96 has 16 sectors [Host vendor specific log] SMART Log at address 0x97 has 16 sectors [Host vendor specific log] SMART Log at address 0x98 has 16 sectors [Host vendor specific log] SMART Log at address 0x99 has 16 sectors [Host vendor specific log] SMART Log at address 0x9a has 16 sectors [Host vendor specific log] SMART Log at address 0x9b has 16 sectors [Host vendor specific log] SMART Log at address 0x9c has 16 sectors [Host vendor specific log] SMART Log at address 0x9d has 16 sectors [Host vendor specific log] SMART Log at address 0x9e has 16 sectors [Host vendor specific log] SMART Log at address 0x9f has 16 sectors [Host vendor specific log] SMART Log at address 0xa1 has 1 sectors [Device vendor specific log] SMART Extended Comprehensive Error Log (GP Log 0x03) not supported SMART Error Log Version: 1 No Errors Logged
5.
SMART Extended Self-test Log (GP Log 0x07) not supported SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 12791 - # 2 Short offline Completed without error 00% 12780 - # 3 Short offline Aborted by host 80% 12780 - # 4 Extended offline Completed without error 00% 12766 - # 5 Short offline Completed without error 00% 12764 - # 6 Short offline Completed without error 00% 12740 - # 7 Short offline Completed without error 00% 12729 - # 8 Short offline Completed without error 00% 12683 - # 9 Short offline Completed without error 00% 12682 - #10 Short offline Completed without error 00% 12626 - #11 Short offline Completed without error 00% 12566 - #12 Short offline Completed without error 00% 12459 - #13 Extended offline Completed without error 00% 12456 - #14 Short offline Completed without error 00% 12454 - #15 Short offline Completed without error 00% 12251 - #16 Short offline Completed without error 00% 12241 - #17 Short offline Completed without error 00% 12190 - #18 Short offline Completed without error 00% 12151 - #19 Extended offline Completed without error 00% 12143 - #20 Extended offline Aborted by host 30% 12141 - #21 Extended offline Completed without error 00% 12137 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
6.
SCT Status Version: 2 SCT Version (vendor specific): 0 (0x0000) SCT Support Level: 1 Device State: Active (0) Current Temperature: 36 Celsius Power Cycle Max Temperature: 37 Celsius Lifetime Max Temperature: 51 Celsius SCT Temperature History Version: 2 Temperature Sampling Period: 1 minute Temperature Logging Interval: 1 minute Min/Max recommended Temperature: 5/60 Celsius Min/Max Temperature Limit: -40/65 Celsius Temperature History Size (Index): 128 (26) Index Estimated Time Temperature Celsius 27 2011-01-20 05:09 39 ******************** ... ..( 4 skipped). .. ******************** 32 2011-01-20 05:14 39 ******************** 33 2011-01-20 05:15 38 ******************* ... ..( 12 skipped). .. ******************* 46 2011-01-20 05:28 38 ******************* 47 2011-01-20 05:29 37 ****************** ... ..( 5 skipped). .. ****************** 53 2011-01-20 05:35 37 ****************** 54 2011-01-20 05:36 38 ******************* 55 2011-01-20 05:37 38 ******************* 56 2011-01-20 05:38 37 ****************** ... ..( 3 skipped). .. ****************** 60 2011-01-20 05:42 37 ****************** 61 2011-01-20 05:43 ? - 62 2011-01-20 05:44 37 ****************** 63 2011-01-20 05:45 ? - 64 2011-01-20 05:46 35 **************** ... ..( 5 skipped). .. **************** 70 2011-01-20 05:52 35 **************** 71 2011-01-20 05:53 ? - 72 2011-01-20 05:54 20 * 73 2011-01-20 05:55 21 ** 74 2011-01-20 05:56 22 *** 75 2011-01-20 05:57 23 **** 76 2011-01-20 05:58 23 **** 77 2011-01-20 05:59 24 ***** 78 2011-01-20 06:00 24 ***** 79 2011-01-20 06:01 25 ****** 80 2011-01-20 06:02 25 ****** 81 2011-01-20 06:03 25 ****** 82 2011-01-20 06:04 26 ******* 83 2011-01-20 06:05 26 ******* 84 2011-01-20 06:06 27 ******** 85 2011-01-20 06:07 27 ******** 86 2011-01-20 06:08 28 ********* 87 2011-01-20 06:09 28 ********* 88 2011-01-20 06:10 28 ********* 89 2011-01-20 06:11 29 ********** 90 2011-01-20 06:12 29 ********** 91 2011-01-20 06:13 29 ********** 92 2011-01-20 06:14 30 *********** 93 2011-01-20 06:15 ? - 94 2011-01-20 06:16 18 - 95 2011-01-20 06:17 19 - 96 2011-01-20 06:18 20 * 97 2011-01-20 06:19 21 ** 98 2011-01-20 06:20 22 *** 99 2011-01-20 06:21 23 **** 100 2011-01-20 06:22 23 **** 101 2011-01-20 06:23 24 ***** 102 2011-01-20 06:24 25 ****** 103 2011-01-20 06:25 25 ****** 104 2011-01-20 06:26 25 ****** 105 2011-01-20 06:27 26 ******* 106 2011-01-20 06:28 27 ******** 107 2011-01-20 06:29 28 ********* 108 2011-01-20 06:30 29 ********** 109 2011-01-20 06:31 30 *********** 110 2011-01-20 06:32 30 *********** 111 2011-01-20 06:33 30 *********** 112 2011-01-20 06:34 31 ************ ... ..( 2 skipped). .. ************ 115 2011-01-20 06:37 31 ************ 116 2011-01-20 06:38 32 ************* 117 2011-01-20 06:39 32 ************* 118 2011-01-20 06:40 33 ************** ... ..( 6 skipped). .. ************** 125 2011-01-20 06:47 33 ************** 126 2011-01-20 06:48 34 *************** 127 2011-01-20 06:49 34 *************** 0 2011-01-20 06:50 34 *************** 1 2011-01-20 06:51 35 **************** ... ..( 9 skipped). .. **************** 11 2011-01-20 07:01 35 **************** 12 2011-01-20 07:02 36 ***************** 13 2011-01-20 07:03 35 **************** 14 2011-01-20 07:04 36 ***************** 15 2011-01-20 07:05 35 **************** 16 2011-01-20 07:06 36 ***************** ... ..( 8 skipped). .. ***************** 25 2011-01-20 07:15 36 ***************** 26 2011-01-20 07:16 37 ******************
Error SMART WRITE LOG does not return COUNT and LBA_LOW register Warning: device does not support SCT (Get) Error Recovery Control command SATA Phy Event Counters (GP Log 0x11) not supported

The output of smartctl -x is divided into six sections, the first two (marked above as 1 and 2, respectively) provide information about the device: the manufacturer, model and device identifiers, which standards and protocols are supported, etc. Most important for the health assessment of the device is is the first data line of section which indicates whether the device PASSED or FAILED the test. (See above: 'SMART overall-health self-assessment test result: PASSED'.) This is the only standard element a device must support in order to be compliant with the SMART standard.

SMART attributes (section 3 above), on the other hand, are not part of a standard but are vendor-specific additions. The lower-numbered attributes are more likely to be found than the higher ones. Attribute values, too, are vendor-specific rather than standard representations. Consequently, the meaning of the values can not be generalized. Columns VALUE, WORST, and THRESH most often indicate a present value state, the worst performance state in the lifetime of the device, and a threshold value that will trigger an attribute-failure flag when it is surpassed, respectively. However, attributes that function as counters most often do not have threshold values, and sometimes 'WORST' values are identical to those of the 'VALUE' column and move up and down. The reason for this seems to be that SMART attributes are not standardized and need to evaluated according to vendor specific criteria.

smartmontools a utility program that can be configured to emails when one or more of the attributes is failing. (smartd runs as a daemon that polls a device every 30 minutes and logs attribute changes.) If an attribute is failing, column THRESH will read FAILING_NOW and thus indicate that a serious condition exists. If the value returns to the normal working range, the 'FAILING_NOW' column will read In_the_past.

SMART Tests

SMART-enabled hard drives offer a number of self-tests that can be initiated via smartctl. Most commonly available and useful test are the short and extended offline test. A short offline test checks the electrical and mechanical operation and read performance of the drive. It also tests sections of the drive's surface. An extended test checks the entire surface of the drive. Some SMART attributes are also updated only during offline self-test. (See column UPDATED of the SMART attribute section: offline indicates that the value is updated only during a test.) The completion time of a short test is a few minutes; an extended test takes considerably longer, depending on the size of the drive. The information section of the smartctl output (see section 1 above) indicates the time needed for test completion.

The result of self-tests appears in its own section of the smartctl output (see section 5 above). The test type, completion of the test, start time (measured in lifetime hours), and the location of the possible first error are registered. Note that a test may be aborted if performance demands (a high volume of reads/writes) intervene.

Not all hard drives support extended temperature logs as in the example above (see section 6 above, SCT Status). Those that do log temperatures at regular time intervals. Sampling is most often done in minutes, but other time intervals are possible, too. The header of the temperature log indicates the sampling period, present and past registered temperatures, as well as maximum and minimum working temperatures. SCT logs as in the example, obtained by running and collecting smartctl at regular time intervals, make it possible to obtain a complete picture of a drive's working temperatures.

No comments:

Post a Comment