btrfs   723

« earlier    

Battle testing data integrity verification with ZFS, Btrfs and mdadm+dm-integrity
In this article I share the results of a home-lab experiment in which I threw some different problems at ZFS, Btrfs and mdadm+dm-integrity in a RAID-5 setup.
zfs  btrfs  linux  storage  myths  bsdnow  unix  bsd  raid  filesystem  choices  research  thanks 
6 days ago by xer0x
bertbaron/btrdedup: BTRFS Deduplication tool
BTRFS Deduplication tool
========================

Deduplication tool like [bedup](https://github.com/g2p/bedup). I
wrote it quite some time ago already because bedup had problems
with my volume and the number of snapshots (crashes, database
corruption etc.)

Btrdedup uses much less resources especially in case of many
snapshots. The limitation is that it only deduplicates files that
start with the same content. By inspecting the fragmentation
before offering the files for deduplication to the kernel (using
the btrfs deduplication ioctl) data that is already shared will
not be deduplicated again.

Btrdedup does not maintain state between runs. This makes it less
suitable for incremental deduplication. On the other hand it
makes the tool very robust and because of its efficiency in
detecting already deduplicated files it can easily be scheduled
to run once a month for example.

Installation
============

Download the latest release:

[![release](http://github-release-version.herokuapp.com/github/bertbaron/btrdedup/release.svg)](https://github.com/bertbaron/btrdedup/releases/latest)

Make executable using: `chmod +x btrdedup`

Usage
=====

Typically you want to run the program as root on the complete
mounted btrfs pool with a command like this:

``` {.shell}
nice -n 10 ./btrdedup /mnt 2>dedup.log
```

or

``` {.shell}
nice -n 10 ./btrdedup /mnt >dedup.out 2>dedup.log &
```

The scanning phase may still take a long time depending on the
number of files. The most expensive part however, the
deduplication itself, is only called when necessary.

Btrfdedup is very memory efficient and doesn't require a
database. It can be instructed to use even less memory by
providing the `-lowmem` option. This may require a few more
minutes, but it may also be faster because of reduced memory
management. Future versions might default to this option.

Use `btrdedup -h` for the full list of options.

Under the hood
==============

Btrdedup works by first reading the file tree(s) in memory in an
efficient data structure. It then processes these files in three
passes:

- Pass 1: Read the fragmentation table for each file.

Sort the result on the offset of the first block

- Pass 2: Calculate the hash of the first block of each file.
Because the files are sorted on the first block offset, any
block is only loaded and hashed once.

Sort the result on the hash of the first block

- Pass 3: Files that have the first block in common are offered
for deduplication. The deduplication phase will first check
if blocks are already shared to only offer data for actual
deduplication if necessary.

In lowmem mode, the output of each pass is written to an encoded
temporary text file which is then sorted using the systems `sort`
tool.

Future improvements
===================

The last pass still needs some improvents. Currently files with
the same hashcode for the first block are assumed to be equal to
the size of the smallest file. In the future the blocks should be
more thoroughly checked for duplicates, by comparing the hash
codes of all blocks.
btrfs  deduplication  language:go  linux  tools 
17 days ago by thedward
markfasheh/duperemove: Tools for deduping file systems
This README is for duperemove v0.11.

Duperemove
==========

Duperemove is a simple tool for finding duplicated extents and
submitting them for deduplication. When given a list of files it
will hash their contents on a block by block basis and compare
those hashes to each other, finding and categorizing blocks that
match each other. When given the -d option, duperemove will
submit those extents for deduplication using the Linux kernel
extent-same ioctl.

Duperemove can store the hashes it computes in a 'hashfile'. If
given an existing hashfile, duperemove will only compute hashes
for those files which have changed since the last run. Thus you
can run duperemove repeatedly on your data as it changes, without
having to re-checksum unchanged data.

Duperemove can also take input from the
[fdupes](https://github.com/adrianlopezroche/fdupes) program.

See [the duperemove man
page](http://markfasheh.github.io/duperemove/duperemove.html) for
further details about running duperemove.

Requirements
============

The latest stable code (v0.11) can be found in [the v0.11 branch
on
github](https://github.com/markfasheh/duperemove/tree/v0.11-branch).

Kernel: Duperemove needs a kernel version equal to or greater
than 3.13

Libraries: Duperemove uses glib2 and sqlite3.

FAQ
===

Please see the FAQ section in [the duperemove man
page](http://markfasheh.github.io/duperemove/duperemove.html#10)

For bug reports and feature requests please use [the github issue
tracker](https://github.com/markfasheh/duperemove/issues)

Examples
========

Please see the examples section of [the duperemove man
page](http://markfasheh.github.io/duperemove/duperemove.html#7)
for a complete set of usage examples, including hashfile usage.

A simple example, with program output
-------------------------------------

Duperemove takes a list of files and directories to scan for
dedupe. If a directory is specified, all regular files within it
will be scanned. Duperemove can also be told to recursively scan
directories with the '-r' switch. If '-h' is provided, duperemove
will print numbers in powers of 1024 (e.g., "128K").

Assume this abitrary layout for the following examples.

.
├── dir1
│ ├── file3
│ ├── file4
│ └── subdir1
│ └── file5
├── file1
└── file2

This will dedupe files 'file1' and 'file2':

duperemove -dh file1 file2

This does the same but adds any files in dir1 (file3 and file4):

duperemove -dh file1 file2 dir1

This will dedupe exactly the same as above but will recursively
walk dir1, thus adding file5.

duperemove -dhr file1 file2 dir1/

An actual run, output will differ according to duperemove
version.

Using 128K blocks
Using hash: murmur3
Using 4 threads for file hashing phase
csum: /btrfs/file1 [1/5] (20.00%)
csum: /btrfs/file2 [2/5] (40.00%)
csum: /btrfs/dir1/subdir1/file5 [3/5] (60.00%)
csum: /btrfs/dir1/file3 [4/5] (80.00%)
csum: /btrfs/dir1/file4 [5/5] (100.00%)
Total files: 5
Total hashes: 80
Loading only duplicated hashes from hashfile.
Hashing completed. Calculating duplicate extents - this may take some time.
Simple read and compare of file data found 3 instances of extents that might benefit from deduplication.
Showing 2 identical extents of length 512.0K with id 0971ffa6
Start Filename
512.0K "/btrfs/file1"
1.5M "/btrfs/dir1/file4"
Showing 2 identical extents of length 1.0M with id b34ffe8f
Start Filename
0.0 "/btrfs/dir1/file4"
0.0 "/btrfs/dir1/file3"
Showing 3 identical extents of length 1.5M with id f913dceb
Start Filename
0.0 "/btrfs/file2"
0.0 "/btrfs/dir1/file3"
0.0 "/btrfs/dir1/subdir1/file5"
Using 4 threads for dedupe phase
[0x147f4a0] Try to dedupe extents with id 0971ffa6
[0x147f770] Try to dedupe extents with id b34ffe8f
[0x147f680] Try to dedupe extents with id f913dceb
[0x147f4a0] Dedupe 1 extents (id: 0971ffa6) with target: (512.0K, 512.0K), "/btrfs/file1"
[0x147f770] Dedupe 1 extents (id: b34ffe8f) with target: (0.0, 1.0M), "/btrfs/dir1/file4"
[0x147f680] Dedupe 2 extents (id: f913dceb) with target: (0.0, 1.5M), "/btrfs/file2"
Kernel processed data (excludes target files): 4.5M
Comparison of extent info shows a net change in shared extents of: 5.5M

Links of interest
=================

[The duperemove
wiki](https://github.com/markfasheh/duperemove/wiki) has both
design and performance documentation.

[duperemove-tests](https://github.com/markfasheh/duperemove-tests)
has a growing assortment of regression tests.

[Duperemove web page](http://markfasheh.github.io/duperemove/)
deduplication  btrfs  extent-same  xfs 
17 days ago by thedward
g2p/bedup: Btrfs deduplication
Deduplication for Btrfs.

bedup looks for new and changed files, making sure that multiple
copies of identical files share space on disk. It integrates
deeply with btrfs so that scans are incremental and low-impact.

Requirements
============

You need Python 3.3 or newer, and Linux 3.3 or newer. Linux 3.9.4
or newer is recommended, because it fixes a scanning bug and is
compatible with cross-volume deduplication.

This should get you started on Ubuntu 16.04:

sudo aptitude install python3-pip python3-dev python3-cffi libffi-dev build-essential git

This should get you started on earlier versions of Debian/Ubuntu:

sudo aptitude install python3-pip python3-dev libffi-dev build-essential git

This should get you started on Fedora:

yum install python3-pip python3-devel libffi-devel gcc git

Installation
============

On systems other than Ubuntu 16.04 you need to install CFFI:

pip3 install --user cffi

Option 1 (recommended): from a git clone
----------------------------------------

Enable submodules (this will pull headers from btrfs-progs)

git submodule update --init

Complete the installation. This will compile some code with CFFI
and pull the rest of our Python dependencies:

python3 setup.py install --user
cp -lt ~/bin ~/.local/bin/bedup

Option 2: from a PyPI release
-----------------------------

pip3 install --user bedup
cp -lt ~/bin ~/.local/bin/bedup

Running
=======

bedup --help
bedup <command> --help

On Debian and Fedora, you may need to use [sudo -E
\~/bin/bedup]{.title-ref} or install cffi and bedup as root
(bedup and its dependencies will get installed to /usr/local).

You\'ll see a list of supported commands.

- **scan** scans volumes to keep track of potentially
duplicated files.
- **dedup** runs scan, then deduplicates identical files.
- **show** shows btrfs filesystems and their tracking status.
- **dedup-files** takes a list of identical files and
deduplicates them.
- **find-new** reimplements the `btrfs subvolume find-new`
command with a few extra options.

To deduplicate all filesystems: :

sudo bedup dedup

Unmounted or read-only filesystems are excluded if they aren\'t
listed on the command line. Filesystems can be referenced by uuid
or by a path in /dev: :

sudo bedup dedup /dev/disks/by-label/Btrfs

Giving a subvolume path also works, and will include subvolumes
by default.

Since cross-subvolume deduplication requires Linux 3.6, users of
older kernels should use the `--no-crossvol` flag.

Hacking
=======

pip3 install --user pytest tox ipdb https://github.com/jbalogh/check

To run the tests:

sudo python3 -m pytest -s bedup

To test compatibility and packaging as well:

GETROOT=/usr/bin/sudo tox

Run a style check on edited files:

check.py

Implementation
==============

Deduplication is implemented using a Btrfs feature that allows
for cloning data from one file to the other. The cloned ranges
become shared on disk, saving space.

File metadata isn\'t affected, and later changes to one file
won\'t affect the other (this is unlike hard-linking).

This approach doesn\'t require special kernel support, but it has
two downsides: locking has to be done in userspace, and there is
no way to free space within read-only (frozen) snapshots.

Scanning
--------

Scanning is done incrementally, the technique is similar to
`btrfs subvolume find-new`. You need an up-to-date kernel (3.10,
3.9.4, 3.8.13.1, 3.6.11.5, 3.5.7.14, 3.4.47) to index all files;
earlier releases have a bug that causes find-new to end
prematurely. The fix can also be cherry-picked from [this
commit](https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/patch/?id=514b17caf165ec31d1f6b9d40c645aed55a0b721).

Locking
-------

Before cloning, we need to lock the files so that their contents
don\'t change from the time the data is compared to the time it
is cloned. Implementation note: This is done by setting the
immutable attribute on the file, scanning /proc to see if some
processes still have write access to the file (via preexisting
file descriptors or memory mappings), bailing if the file is in
write use. If all is well, the comparison and cloning steps can
proceed. The immutable attribute is then reverted.

This locking process might not be fool-proof in all cases; for
example a malicious application might manage to bypass it, which
would allow it to change the contents of files it doesn\'t have
access to.

There is also a small time window when an application will get
permission errors, if it tries to get write access to a file we
have already started to deduplicate.

Finally, a system crash at the wrong time could leave some files
immutable. They will be reported at the next run; fix them using
the `chattr -i` command.

Subvolumes
----------

The clone call is considered a write operation and won\'t work on
read-only snapshots.

Before Linux 3.6, the clone call didn\'t work across subvolumes.

Defragmentation
---------------

Before Linux 3.9, defragmentation could break copy-on-write
sharing, which made it inadvisable when snapshots or
deduplication are used. Btrfs defragmentation has to be
explicitly requested (or background defragmentation enabled), so
this generally shouldn\'t be a problem for users who were unaware
of the feature.

Users of Linux 3.9 or newer can safely pass the
[\--defrag]{.title-ref} option to [bedup dedup]{.title-ref},
which will defragment files before deduplicating them.

Reporting bugs
==============

Be sure to mention the following:

- Linux kernel version: uname -rv
- Python version
- Distribution

And give some of the program output.

Build status
============

[![image](https://travis-ci.org/g2p/bedup.png)](https://travis-ci.org/g2p/bedup)
btrfs  deduplication  language:python  linux  incremental 
17 days ago by thedward
Linux RAM and Disk Hacking with ZRAM and BTRFS · naftuli.wtf
At a recent job, I faced a pretty bleak situation: my MacBook Pro had only 8 gigabytes of RAM and...
btrfs 
march 2019 by ianweatherhogg
How to Create and Manage Btrfs Snapshots and Rollbacks on Linux (part 2) | Linux.com | The source for Linux information
In "How to Manage Btrfs Storage Pools, Subvolumes And Snapshots on Linux (part 1)" we learned how to create a nice little Btrfs test lab, and how to create a Btrfs storage volume. Now we're going to learn how to make live snapshots whenever we want, and how to roll the filesystem back to any point to any arbitrary point in time. This does not replace backups.
btrfs  linux 
march 2019 by frailty
Install Arch Linux With Btrfs Snapshotting - Vultr.com
Deploy high performance SSD VPS on the worldwide Vultr network in 60 seconds. Sign up for free and start hosting virtual servers today!
btrfs  archlinux 
february 2019 by ianweatherhogg

« earlier    

related tags

-  18.2  2016  3stars  4stars  5*  8  @goodie  admin  alternative  alternatives  and  arch  archlinux  aws  backup  baobab  best  bitcoin  blockchain  boot  bsd  bsdnow  cassandra  centos  cheat-sheets  choice  choices  chroot  cli  cloud  cluster  compression  computers  configuration  container  convert_to_single  cow  crypto  data-structures  database  debian  debug  deduplication  default  deprecated  df  digitalocean  disk-storage  disk  diy  do  docker  documentation  drive  driver  dropbox  du  duc  ec2  encryption  example  ext  ext2  ext3  ext4  extent-same  fedora  file  files  filesystem  filesystems  floss  forum  free  freenas  fs  fulldisk  gentoo  gist  git  guy  hat  household.it  howto  incremental  info  inotify  installation  jfs  kernel  language:go  language:python  linux  luks  media  merkle-tree  mint  module  monit  mount  multi_boot  multiboot  myths  nas  network  nilfs  nixos  nocow  open  opensource  opensuse  optimal  or  performance  postgresql  raid  recover  red  redhat  reflink  repair  replication  research  rockstor  rollback  s3  san  security  server  sles  smartd  snapper  snapraid  snapshot  snapshots  software  solutions  source  space  specs  speed  ssd  storage  stratis  subvolume  suse  synology  sysadmin  systemd  thanks  the  tips  tools  trim  ubuntu  unix  unraid  vm  wikientry  windows  with  workload  xdiskusage  xfs  zfs  zfx  |h2s 

Copy this bookmark:



description:


tags: