Project Description

WebFamulus is a web application using the Quartz scheduler framework for process automation. It provides a web interface to schedule, monitor, and interact with Java processes that are configured as 'jobs'. Access to user and administrative interfaces is based on permissions and group ownership. Likewise, the configuration of jobs, alerts, performance monitoring and statistics is based on group ownership and permissions granted to an individual user. The application is currently under development and a beta-version is available to selected clients.


Typical applications for the Quartz framework are network and hardware monitors, database maintenance, reporting, etc. Tedious, time-consuming tasks can run repeatedly whenever possible following sophisticated schedules, including selected dates and weekdays, black-out times, holidays, etc. The field of application, however, is limited only by what the programming language cannot provide. Quartz makes it possible to schedule and execute automatically just about anything that can be programmed in Java.


WebFamulus, in addition, supports jobs to be executed as sequences based on the (optional) evaluation of the preceding job's exit status. Furthermore, the application is backed by an extensive database that makes it possible to let non-administrative users create, configure, interact with, and monitor their own jobs, alerts, and reports on-line or via electronic mail.


2013-12-06

4. Using Quartz to Monitor SMART Hard Drives as an Automated Service

Processing Steps

Using the Quartz Scheduler to monitor SMART-enabled hard drives makes it possible to automate the entire process from data generation and collection to reporting. Key advantages are minimal user intervention and the compilation of a consistent long-term data set. Quartz provides a powerful and higly flexible scheduling framework that can execute hundreds or thousands of computing tasks repeatedly without user intervention. Once set up, a schedule will run until the scheduler is interrupted or shut down. This is especially advantageous for recurring and repetitive tasks such as monitoring hardware. When the data gathered by such tasks is not only evaluated for possible errors or faulty behavior but also stored on a long-term basis, it is possible to compile a record of 'normal operation' from which newly occurring anomalies stand out more clearly.

SMART data is generated, of course, by running smartctl with the appropriate options on the host machine (see 1. below). If the host machine is accessible via SSH, WebFamulus can execute smartctl itself and capture the output. Alternatively, a script (Python, Perl, etc.) installed on the host machine and configured to be executed by a utility program such as cron can run smartctl and forward the output via email or file upload to WebFamulus.

SMART-Data Processing

The Smartmontools daemon (smartd) can be configured to send out emails when a new error is detected in a logfile or when a SMART attribute is failing. However, error logs are only updated when self-tests are run and, hence, an error may only be detected after a test has been initiated. SMART-enabled devices usually can run two self-tests: short and long. A short test, which takes about two to three minutes, updates the SMART attributes, checks the electrical and mechanical parts, and performs spot checks on the disk surface. A long test--which scans the entire disk surface for errors--may take hours to complete, depending on the size of the disk. Hence, regular testing of hard drives is imparative in order to minimize the danger of data loss due to equipment failure.

WebFamulus uses the Quartz scheduler and an extensive background database to generate, store, and analyze SMART data. All steps are automated and run on a daily, weekly, or longer schedule but, through a web interface, they can also be executed on demand. Furthermore, some jobs are executed as sequences. For example, before a daily report is issued, a collector job gathers all SMART reports that have not been processed and initiates a parsing job; then graphs and summary reports are compiled, and a mailer job sends the report to the stakeholders. (That is, steps 2 to 6 are executed as a sequence before step 7 is initiated. Otherwise, step 2, for example, may be running independently all the time at regular intervals.)

The advantages of regular testing and reporting are obvious: it establishes a basis of normal operation against which errors and faulty behavior is more easily noticed. A rise in the average working temperature, for example, could be an indication of a fan not working properly. This may go unnoticed if only occasional spot checks are done. With automated SMART testing and long-term, data storage the chances for early detection--before a serioushardware failure--are much higher.

Yet, even a small office may have tens of computers to monitor, and testing and evaluating can be a time-consuming task. Automating the process with a Quartz-based application like WebFamulus allows hundreds or thousands of hard drives to be monitored every day with minimal intervention beyond the initial set-up. And, having test results stored in a database makes it possible to document past performance and maintenance, which is very useful for system administrators when hard disk monitoring is part of a service agreement.

2013-12-05

3. SMART temperature surveys

Data Collection

Hard disks with SCT (SMART Command Transport) capabilities maintain a log of the working temperature of the drive. Most drives register entries at one minute intervals. The header of the SCT section of the smartctl output indicates the registration interval.

SCT-enabled drives are capable of recording temperatures of ±127°C. When the device is powered up, an illegal value is entered so that the starting point is is registered. (In smartctl output, this value appers as as a question mark '?'.) The log itself is a circular buffer. Hence, when the temperature log is read, only the values after the last question mark (if there is one) have a correct time reference (namely, N entries at the logging interval time before the date and time the SMART device was queried).

Reporting

I. Daily Reports

Temperature surveys reflect usage patters of computers. The following graph illustrates the typical use of laptop; this one having been put into sleep mode twice on the day sampled.


The same laptop running continuously as a low-volume server.




The following example shows an office workstation during a regular pattern utilization. The machine is used for image processing.



The same machine showing a pattern of high utilization. This pattern, a high temperature maintained over a long period of time, occurs occasionally. The pattern, when it occurs, is always the same: an ascending curve to a high temperature that maintained for several hours. Unfortunately, it has been impossible to ascertain what software was running at the time. The staff reported a normal use. A mechanical problem, like a malfunctioning fan, did not appear to have been the cause and was ruled out.



Monthly Reports

Summary reports over longer time periods help to pinpoint occurrences of irregularities. Of particular interest are, of course, unusual high and low temperatures. A rise in the average curve is indicative a prolonged operation a higher temperatures. (See 09/13 and 09/15, which reflect a brief and prolonged high disk use, respectively.) Upward spikes in low temperature values are due to the fact that the first SMART test of the day ran some time after the computer was turned on.



Annual Reports

Temperature measurements from computers that do not operate continuously—desktop workstations and laptops, for example—reflect the ambient temperatures. The data of the following graph was collected from a laptop that is usually turned off over night. This usually provided the lowest temperature of the day. (The high spikes in the low temperature are an indication that a SMART test was not run within the first two hours after computer was started.)
Since the power supply that recharges the battery is connected, the temperatures inside the casing are of course higher than the ambient room temperature. The low temperature range (in blue) before July and in September reflects seasonal temperatures in Italy: warm, summerly temperatures before July and in September, fall temperatures in October and November, and finally the onset of winter in January. The periods of July, August, mid-December and early January, were measured in Germany: colder than Italy during the summer, but warmer in the winter. Interestingly, unlike the ambient temperatures in Italy, those in Germany remain more or less constant. Conclusion: Building construction in Germany proves to be more weatherproof!

2013-12-04

2. Data Collection and Reporting


Server Side

Data collection, processing, and reporting is completely automated. The server is configured to accept data file uploads from registered clients, parse the data files, and issue alerts, if configured thresholds are passed, and summary reports at pre-defined intervals.

However, the server may also initiated data collection or interaction with smartmontools on its own. To contact computers on a local or remote network, it uses SSH, runs smartmontools via the command line, and then captures and processes the output data. A prerequisite is, of course, that the server is configured with the correct IP address, user name and password, and that the user has been granted access and execution permissions on the client side.

Client Side

In some environments, it is impossible, however, to connect to computer remotely--be it that it is located behind a firewall or that the execution of smartmontools commands requires permissions that cannot be granted to a remote user. In such cases, client-side scripts can issue smartmontools command and forward any output data to the server via email, file upload, or web service. The scripts can scheduled for execution with a utility like cron or a locally installed Quartz application.



Sample Daily Report of a Single Hard Drive


Weekly Spreadsheet Report

Firma: NetzwerkÜberwachung GmbH








Kunde: Müller-Schulz









Erfassungszeitraum: 2012-01-24 – 2012-01-31 Tage: 7







Berichte ausgewertet: 46































1. Festplattenübersicht























Zählung





Computer Hersteller Modell Laufzeit (Std.) Ein/Aus Start/Stop Alter




DAK-210 Western Digital Caviar family WDC WD400BB-60DGA0 17684 34945 42450 2.4




DAK-211 Seagate Barracuda 7200.11 family ST3320813AS 12397 162 162 2.6




DAK-212 Seagate Barracuda 7200.11 family ST3320813AS 6559 439 486 1.1




DAK-213 Western Digital Caviar family WDC WD400BB-60DGA0 16869 2861 2862 5.9




DAK-214 Seagate U7 family ST340012A 14766 2116 1 5.5




DAK-215 Hitachi Deskstar 7K1000.B Hitachi HDT721064SLA360 4465 457 461 2.4




DAK-216 Western Digital Caviar SE Serial ATA family WDC WD800JD-60LSA5 15837 5786 5788 6.7




DAK-217 Western Digital Caviar SE Serial ATA family WDC WD800JD-60LSA5 2654 780 782

3.8





DAK-218 Western Digital Caviar SE Serial ATA family WDC WD800JD-60LSA5 9386 680 682 6.1


























2. Testübersicht





















Tests


Tests Fehlerhafte Attribute

Computer erster letzter Tage Arbeitsstunden Anzahl bestanden fehlerhaft neu vormals LBA-Fehler
DAK-210 24.01.12 14:00 30.01.12 14:00 6 132 7 7 0 0 0 0
DAK-211 24.01.12 14:01 30.01.12 14:01 6 144 7 7 0 0 0 0
DAK-212 24.01.12 14:02 28.01.12 14:02 4 96 5 5 0 0 0 0
DAK-213 24.01.12 14:03 29.01.12 14:03 5 57 6 6 0 0 0 0
DAK-214 24.01.12 14:04 28.01.12 14:04 4 46 5 5 0 0 0 0
DAK-215 24.01.12 14:05 28.01.12 14:05 4 48 5 5 0 0 0 0
DAK-216 24.01.12 14:06 28.01.12 14:06 4 40 5 5 0 0 0 0
DAK-217 25.01.12 14:07

0 0 1 1 0 0 1 0
DAK-218 24.01.12 14:08 28.01.12 14:08 4 40 5 5 0 0 1 0






















3. Fehlerhafte Attribute























Werte



Computer Registrierung Name aktuell tiefster vormals Schwelle Status


























































4. Selbst-Test-Übersicht





















Arbeitsstunden Selbst-Tests

Computer erste letzte Anzahl Anzahl vollständig unvollständig kurz erweitert Conveyance

DAK-210 17552 17684 132 0 0 0 0 0 0

DAK-211 12253 12397 144 10 0 0 5 5 0

DAK-212 6463 6559 96 8 0 0 3 5 0

DAK-213 16812 16869 57 0 0 0 0 0 0

DAK-214 14720 14766 46 4 0 0 3 1 0

DAK-215 4417 4465 48 4 0 0 3 1 0

DAK-216 15797 15837 40 2 0 0 1 1 0

DAK-217 2654 2654 0 0 0 0 0 0 0

DAK-218 9346 9386 40 4 0 0 3 1 0























5. Temperaturübersicht





















Höchstwerte Tiefstwerte


Computer Standzeit aktuell empfohlen Spielraum Standzeit aktuell empfohlen Spielraum


DAK-211 48 40 55 15 20 38 14 24


DAK-212 46 40 55 15 18 38 14 24


DAK-214 47 46


41 45




DAK-215 59 52 60 8 20 49 0 49


DAK-216 63 42 65 23 40 40 5 35


DAK-217 56 37 65 28 37 37 5 32


DAK-218 59 44 65 21 40 40 5 35
























5a. Höchsttemperatur-Verteilung




















Computer Temperatur Anzahl








DAK-211 40 3








DAK-211 39 3








DAK-211 38 1



















DAK-212 40 1








DAK-212 39 2








DAK-212 38 2



















DAK-214 46 3








DAK-214 45 2



















DAK-215 52 4








DAK-215 49 1



















DAK-216 42 1








DAK-216 40 4



















DAK-217 37 1



















DAK-218 44 2








DAK-218 42 2








DAK-218 40 1