[ale] ZFS on Linux

Michael B. Trausch mbt at naunetcorp.com
Mon Apr 1 16:26:32 EDT 2013


On 04/01/2013 03:30 PM, Jim Kinney wrote:
> yeah. that looks fun. So snapshots are a double-edged sword.

This is true for LVM, too, though LVM operates on a block basis, and not
a filesystem basis.  If you have a snapshot allocated, and it fills up
(all storage within it is allocated) it becomes inactive.  This is even
worse:  you cannot even read it, then, because the source will continue
to function, so the snapshot simply becomes lost.

Here is an example of what happens:

[mbt at aloe ~]$ sudo lvcreate -s --name lv_swap_snap01 --size 100M
/dev/vg_aloe/lv_swap
  Rounding up size to full physical extent 128.00 MiB
  Logical volume "lv_swap_snap01" created

[mbt at aloe ~]$ sudo lvdisplay /dev/vg_aloe/lv_swap{,_snap01}
  --- Logical volume ---
  LV Path                /dev/vg_aloe/lv_swap
  LV Name                lv_swap
  VG Name                vg_aloe
  LV UUID                DuqvpX-s6zJ-3QW1-hbIp-Dmvd-PW0z-1faz0v
  LV Write Access        read/write
  LV Creation host, time aloe.naunetcorp.net, 2013-04-01 16:01:05 -0400
  LV snapshot status     source of
                         lv_swap_snap01 [active]
  LV Status              available
  # open                 0
  LV Size                4.00 GiB
  Current LE             128
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0

  --- Logical volume ---
  LV Path                /dev/vg_aloe/lv_swap_snap01
  LV Name                lv_swap_snap01
  VG Name                vg_aloe
  LV UUID                Kn7xJn-2vIh-moWm-DkVF-kYgr-m72U-TRlXdf
  LV Write Access        read/write
  LV Creation host, time aloe.naunetcorp.net, 2013-04-01 16:03:09 -0400
  LV snapshot status     active destination for lv_swap
  LV Status              available
  # open                 0
  LV Size                4.00 GiB
  Current LE             128
  COW-table size         128.00 MiB
  COW-table LE           4
  Allocated to snapshot  0.00%
  Snapshot chunk size    4.00 KiB
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3

Then I wrote zeros:

[mbt at aloe ~]$ sudo dd if=/dev/zero of=/dev/vg_aloe/lv_swap
dd: writing to ‘/dev/vg_aloe/lv_swap’: No space left on device
8388609+0 records in
8388608+0 records out
4294967296 bytes (4.3 GB) copied, 186.521 s, 23.0 MB/s


Now, after writing a whole bunch of zeros to the swap volume:

[mbt at aloe ~]$ sudo lvdisplay /dev/vg_aloe/lv_swap{,_snap01}
  /dev/vg_aloe/lv_swap_snap01: read failed after 0 of 4096 at
4294901760: Input/output error
  /dev/vg_aloe/lv_swap_snap01: read failed after 0 of 4096 at
4294959104: Input/output error
  /dev/vg_aloe/lv_swap_snap01: read failed after 0 of 4096 at 0:
Input/output error
  /dev/vg_aloe/lv_swap_snap01: read failed after 0 of 4096 at 4096:
Input/output error
  --- Logical volume ---
  LV Path                /dev/vg_aloe/lv_swap
  LV Name                lv_swap
  VG Name                vg_aloe
  LV UUID                DuqvpX-s6zJ-3QW1-hbIp-Dmvd-PW0z-1faz0v
  LV Write Access        read/write
  LV Creation host, time aloe.naunetcorp.net, 2013-04-01 16:01:05 -0400
  LV snapshot status     source of
                         lv_swap_snap01 [INACTIVE]
  LV Status              available
  # open                 1
  LV Size                4.00 GiB
  Current LE             128
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0

  --- Logical volume ---
  LV Path                /dev/vg_aloe/lv_swap_snap01
  LV Name                lv_swap_snap01
  VG Name                vg_aloe
  LV UUID                Kn7xJn-2vIh-moWm-DkVF-kYgr-m72U-TRlXdf
  LV Write Access        read/write
  LV Creation host, time aloe.naunetcorp.net, 2013-04-01 16:03:09 -0400
  LV snapshot status     INACTIVE destination for lv_swap
  LV Status              available
  # open                 0
  LV Size                4.00 GiB
  Current LE             128
  COW-table size         128.00 MiB
  COW-table LE           4
  Snapshot chunk size    4.00 KiB
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3

The dd process finished, but the snapshot is useless:

[mbt at aloe ~]$ sudo xxd /dev/vg_aloe/lv_swap|head -n 5
0000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
[mbt at aloe ~]$ sudo xxd /dev/vg_aloe/lv_swap_snap01|head -n 5
xxd: Input/output error

The behavior of btrfs and zfs is, therefore (at least in my opinion) far
more sane.  LVM's behavior means that the system won't grind to a halt,
but btrfs and zfs's behavior means that the system won't lose data.
I'll take data storage robustness over uptime for most applications any
day of the week.

On any system I use snapshots on, I ensure that snapshots are very
short-lived, or that they have enough space to never run out.  With LVM,
the only assurance you have of that is allocating 100% of the size of
the original volume, which is somewhat unfortunate, as it is very wasteful.

It would be nice if the behavior was something that could be configured.
 I'm sure that there are people out there that would prefer to see LVM's
behavior on filesystems like btrfs and ZFS.

> The data deduplication is very useful until it's backup time. It looks
> like the backup will un-deduplicate and use full-size storage less
> backup compression abilities.

Yes, in order to efficiently back those filesystems up, you pretty much
need utilities that understand the filesystem very well.  I don't know
what utilities support things like that in zfs-land, but supposedly
btrfs exposes enough functionality that it is possible to do there.
I've not actually looked much further into that myself, though, as I'm
waiting for btrfs to become, well, usable.  :-)

	--- Mike

-- 
Michael B. Trausch, President
Naunet Corporation

Telephone: (678) 287-0693 x130
Toll-free: (888) 494-5810 x130
FAX: (678) 287-0693

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 901 bytes
Desc: OpenPGP digital signature
URL: <http://mail.ale.org/pipermail/ale/attachments/20130401/d1c17d92/attachment.sig>


More information about the Ale mailing list