[ale] Remounting R/W After Aborted Journal

Drew Wade andrewiwade at gmail.com
Tue Jan 11 14:46:56 EST 2011


Ok so I think I understand the situation now.

I've run into some similar situations where the filesystem has become read
only because it could not get the proper amount of I/O in time (since the
Lun was being shared with other servers which were giving it intense I/O)

There is a kernel parameter which will adjust the timeout values for the
SCSI disk device (vmware increases the scsi timeout from something like 15ms
to 140ms with the install of VMware tools)

http://communities.vmware.com/thread/257251


So if you often get this problem in your iScsi San setup, you can look at
modifying the timeout values for that particular iScsi disk (I know you
arn't using ESX in this case)


As for the journal aborting error, I'd have to stick with fsck.  FSCK is
going to read the journal and replay what hasn't been committed .  The
aborting error makes me think that there was some wrong data written to the
journal then maybe the iScsi disk was unavailable to compelte its I/O
request and left an incomplete journal write.

FSCK will prompt you if you would want to ignore/remove this bad half
written journal entry.

As for your other volumes, they have have just timeouted with their iScsi
backend device and not had any requested I/O writes during this timeout.

That would make it so that when you remounted the logical volume, ext3 did
not see any improper journal write requests and see the filesystem as fine.


On Tue, Jan 11, 2011 at 2:16 PM, Brian Pitts <brian at polibyte.com> wrote:

> On 01/11/2011 01:29 PM, Drew Wade wrote:
> > Jim,
> >
> > You really need to fsck that volume to correct the problem.
> >
> > Since it is in read only mode, you need to umount it or umount -l it if
> > it doesn't respond to umount.
> >
> > Then fsck the logical volume.  Then once that completes you need to
> > remount it.  Even if the application is happily running, if it tries to
> > retrieve a piece of data on a errored filesystem you risk feeding it bad
> > data which it could send downstream (to a DB or another post processing
> > application) and run into even more problems.
> >
> > I'd suggest telling the customer that you need to fix the filesystem and
> > get some outage time from the app owners, then umount and fsk it.  If
> > that fails, you'll have to go to single user mode and fsck it (just make
> > sure you unmount it even in single user mode before the fsck).
>
> FYI it was me asking the question, not Jim.
>
> On three other volumes that suffered the same problem, I already
> unmounted and remounted without an fsck. These are ext3 filesystems with
> the journal in ordered mode.
>
> I am not sure if I should be treating this situation (someone unplugging
> the connection to the storage array) any differently than I would treat
> someone unplugging the power cables from the server itself. In that
> case, I would expect the journal to keep filesystem metadata consistent.
> There might be corrupt data if a program was writing inside an existing
> file (instead of extending), but this isn't something fsck can fix.
>
> http://www.ibm.com/developerworks/library/l-fs7.html
> http://www.redhat.com/support/wpapers/redhat/ext3/tuning.html
>
> --
> All the best,
> Brian Pitts
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.ale.org/pipermail/ale/attachments/20110111/168eb621/attachment.html 


More information about the Ale mailing list