|  | $Id: README,v 1.2 2001/06/21 23:07:06 dwmw2 Exp $ | 
|  | $Log: README,v $ | 
|  | Revision 1.2  2001/06/21 23:07:06  dwmw2 | 
|  | Initial import to MTD CVS | 
|  |  | 
|  | Revision 1.1  2001/06/11 19:34:40  vipin | 
|  | Added README file to dir. | 
|  |  | 
|  |  | 
|  | This is the README file for the "checkfs" power fail test program. | 
|  | By: Vipin Malik | 
|  |  | 
|  | NOTE: This program requires an external "power cycling box" | 
|  | connected to one of the com ports of the system under test. | 
|  | This power cycling box should wait for a random amount of time | 
|  | after it receives a "ok to power me down" message over the | 
|  | serial port, and then yank power to the system under test. | 
|  | (The box that I rigged up tested with waits anywhere from | 
|  | 0 to ~40 seconds). | 
|  |  | 
|  |  | 
|  | It should then restore power after a few seconds and wait for the | 
|  | message again. | 
|  |  | 
|  |  | 
|  | ABOUT: | 
|  |  | 
|  | This program's primary purpose it to test the reliiability | 
|  | of various file systems under Linux. | 
|  |  | 
|  | SETUP: | 
|  |  | 
|  | You need to setup the file system you want to test and run the | 
|  | "makefiles" program ONCE. This creates a set of files that are | 
|  | required by the "checkfs" program. | 
|  |  | 
|  | Also copy the "checkfs" executable program to the same dir. | 
|  |  | 
|  | Then you need to make sure that the program "checkfs" is called | 
|  | automatically on startup. You can customise the operation of | 
|  | the "checkfs" program by passing it various cmd line arguments. | 
|  | run "checkfs -?" for more details. | 
|  |  | 
|  | ****NOTE******* | 
|  | Make sure that you call the checkfs program only after you have | 
|  | mounted the file system you want to test (this is obvious), but | 
|  | also after you have run any "scan" utilities to check for and | 
|  | fix any file systems errors. The e2fsck is one utility for the | 
|  | ext2 file system. For an automated setup you of course need to | 
|  | provide these scan programs to run in standalone mode (-f -y | 
|  | flags for e2fsck for example). | 
|  |  | 
|  | File systems like JFFS and JFFS2 do not have any such external | 
|  | utilities and you may call "checkfs" right after you have mounted | 
|  | the respective file system under test. | 
|  |  | 
|  | There are two ways you can mount the file system under test: | 
|  |  | 
|  | 1. Mount your root fs on a "standard" fs like ext2 and then | 
|  | mount the file system under test (which may be ext2 on another | 
|  | partition or device) and then run "checkfs" on this mounted | 
|  | partition OR | 
|  |  | 
|  | 2. Make your fs AND device that you have put this fs as your | 
|  | root fs and run "checkfs" on the root device (i.e. "/"). | 
|  | You can of course still run checkfs under a separate dir | 
|  | under your "/" root dir. | 
|  |  | 
|  | I have found the second method to be a particularly stringent | 
|  | arrangement (and thus preferred when you are trying to break | 
|  | something). | 
|  |  | 
|  | Using this arrangement I was able to find that JFFS clobbered | 
|  | some "sister" files on the root fs even though "checkfs" would | 
|  | run fine through all its own check files. | 
|  |  | 
|  | (I found this out when one of the clobbered sister file happened | 
|  | to be /bin/bash. The system refused to run rc.local thus | 
|  | preventing my "checkfs" program from being launched :) | 
|  |  | 
|  | "checkfs": | 
|  |  | 
|  | The "formatting" reliability of the fs as well as the file data integrity | 
|  | of files on the fs can be checked using this program. | 
|  |  | 
|  | "formatiing" reliability can only be checked via an indirect method. | 
|  | If there is severe formatting reliability issues with the file system, | 
|  | it will most likely cause other system failures that will prevent this | 
|  | program from running successfully on a power up. This will prevent | 
|  | a "ok to power me down" message from going out to the power cycling | 
|  | black box and prevent power being turned off again. | 
|  |  | 
|  | File data reliability is checked more directly. A fixed number of | 
|  | files are created in the current dir (using the program "makefiles"). | 
|  |  | 
|  | Each file has a random number of bytes in it (set by using the | 
|  | -s cmd line flag). The number of "ints" in the file is stored as the | 
|  | first "int" in it (note: 0 length files are not allowed). Each file | 
|  | is then filled with random data and a 16 bit CRC appended at the end. | 
|  |  | 
|  | When "checkfs" is run, it runs through all files (with predetermined | 
|  | file names)- one at a time- and checks for the number of "int's" | 
|  | in it as well as the ending CRC. | 
|  |  | 
|  | The program exits if the numbers of files that are corrupt are greater | 
|  | that a user specified parameter (set by using the -e cmd line flag). | 
|  |  | 
|  | If the number of corrupt files is less than this parameter, the corrupt | 
|  | files are repaired and operation resumes as explained below. | 
|  |  | 
|  | The idea behind allowing a user specified amount of corrupt files is as | 
|  | follows: | 
|  |  | 
|  | If you are testing for "formatting" reliability of a fs, and for | 
|  | the data reliability of "other" files present of the fs, use -e 1. | 
|  | "other" files are defined as sister files on the fs, not being written to | 
|  | by the "checkfs" test program. | 
|  |  | 
|  | As mentioned, in this case you would set -e 1, or allow at most 1 file | 
|  | to be corrupt each time after a power fail. This would be the file | 
|  | that was probably being written to when power failed (and CRC was not | 
|  | updated to reflect the  new data being written). You would check file | 
|  | systems like ext2 etc. with such a configuration. | 
|  | (As you have no hope that these file systems provide for either your | 
|  | new data or old data to be present in the file if power failed during | 
|  | the write. This is called "roll back and recover".) | 
|  |  | 
|  | With JFFS2 I tested for such "roll back and recover" file data reliability | 
|  | by setting -e 0 and making sure that all writes to the file being | 
|  | updated are done in a *single* write(). | 
|  |  | 
|  | This is how I found that JFFS2 (yet) does NOT support this functionality. | 
|  | (There was a great debate if this was a bug or a feature that was lacking | 
|  | or even an issue at all. See the mtd archives for more details). | 
|  |  | 
|  | In other words, JFFS2 will partially update a file on FLASH even before | 
|  | the write() command has completed, thus leaving part old data part new | 
|  | data in your file if power failed in the middle of a write(). | 
|  |  | 
|  | This is bad functionality if you are updating a binary structure or a | 
|  | CRC protected file (as in our case). | 
|  |  | 
|  |  | 
|  | If All Files Check Out OK: | 
|  |  | 
|  | On the startup scan, if there are less errors than specified by the "-e flag" | 
|  | a "ok to power me down message" is sent via the specified com port. | 
|  |  | 
|  | The actual format of this message will depend on the format expected | 
|  | by the power cycling box that will receive this message. One may customise | 
|  | the actual message that goes out in the "do_pwr_dn)" routine in "comm.c". | 
|  |  | 
|  | This file is called with an open file descriptor to the comm port that | 
|  | this message needs to go out over and the count of the current power | 
|  | cycle (in case your power cycling box can display/log this count). | 
|  |  | 
|  | After this message has been sent out, the checkfs program goes into | 
|  | a while(1) loop of writing new data (with CRC), one at a time, into | 
|  | all the "check files" in the dir. | 
|  |  | 
|  | Its life comes to a sudden end when power is asynchronously pulled from | 
|  | under its feet (by your external power cycling box). | 
|  |  | 
|  | It comes back to life when power is restored and the system boots and | 
|  | checkfs is called from the rc.local script file. | 
|  |  | 
|  | The cycle then repeats till a problem is detected, at which point | 
|  | the "ok to power me down" message is not sent and the cycle stops | 
|  | waiting for the user to examine the system. | 
|  |  | 
|  |  | 
|  |  | 
|  |  |