[linux.git] / fs / cramfs / README

Notes on Filesystem Layout
--------------------------

These notes describe what mkcramfs generates.  Kernel requirements are
a bit looser, e.g. it doesn't care if the <file_data> items are
swapped around (though it does care that directory entries (inodes) in
a given directory are contiguous, as this is used by readdir).

All data is currently in host-endian format; neither mkcramfs nor the
kernel ever do swabbing.  (See section `Block Size' below.)

<filesystem>:
	<superblock>
	<directory_structure>
	<data>

<superblock>: struct cramfs_super (see cramfs_fs.h).

<directory_structure>:
	For each file:
		struct cramfs_inode (see cramfs_fs.h).
		Filename.  Not generally null-terminated, but it is
		 null-padded to a multiple of 4 bytes.

The order of inode traversal is described as "width-first" (not to be
confused with breadth-first); i.e. like depth-first but listing all of
a directory's entries before recursing down its subdirectories: the
same order as `ls -AUR' (but without the /^\..*:$/ directory header
lines); put another way, the same order as `find -type d -exec
ls -AU1 {} \;'.

Beginning in 2.4.7, directory entries are sorted.  This optimization
allows cramfs_lookup to return more quickly when a filename does not
exist, speeds up user-space directory sorts, etc.

<data>:
	One <file_data> for each file that's either a symlink or a
	 regular file of non-zero st_size.

<file_data>:
	nblocks * <block_pointer>
	 (where nblocks = (st_size - 1) / blksize + 1)
	nblocks * <block>
	padding to multiple of 4 bytes

The i'th <block_pointer> for a file stores the byte offset of the
*end* of the i'th <block> (i.e. one past the last byte, which is the
same as the start of the (i+1)'th <block> if there is one).  The first
<block> immediately follows the last <block_pointer> for the file.
<block_pointer>s are each 32 bits long.

When the CRAMFS_FLAG_EXT_BLOCK_POINTERS capability bit is set, each
<block_pointer>'s top bits may contain special flags as follows:

CRAMFS_BLK_FLAG_UNCOMPRESSED (bit 31):
	The block data is not compressed and should be copied verbatim.

CRAMFS_BLK_FLAG_DIRECT_PTR (bit 30):
	The <block_pointer> stores the actual block start offset and not
	its end, shifted right by 2 bits. The block must therefore be
	aligned to a 4-byte boundary. The block size is either blksize
	if CRAMFS_BLK_FLAG_UNCOMPRESSED is also specified, otherwise
	the compressed data length is included in the first 2 bytes of
	the block data. This is used to allow discontiguous data layout
	and specific data block alignments e.g. for XIP applications.


The order of <file_data>'s is a depth-first descent of the directory
tree, i.e. the same order as `find -size +0 \( -type f -o -type l \)
-print'.


<block>: The i'th <block> is the output of zlib's compress function
applied to the i'th blksize-sized chunk of the input data if the
corresponding CRAMFS_BLK_FLAG_UNCOMPRESSED <block_ptr> bit is not set,
otherwise it is the input data directly.
(For the last <block> of the file, the input may of course be smaller.)
Each <block> may be a different size.  (See <block_pointer> above.)

<block>s are merely byte-aligned, not generally u32-aligned.

When CRAMFS_BLK_FLAG_DIRECT_PTR is specified then the corresponding
<block> may be located anywhere and not necessarily contiguous with
the previous/next blocks. In that case it is minimally u32-aligned.
If CRAMFS_BLK_FLAG_UNCOMPRESSED is also specified then the size is always
blksize except for the last block which is limited by the file length.
If CRAMFS_BLK_FLAG_DIRECT_PTR is set and CRAMFS_BLK_FLAG_UNCOMPRESSED
is not set then the first 2 bytes of the block contains the size of the
remaining block data as this cannot be determined from the placement of
logically adjacent blocks.


Holes
-----

This kernel supports cramfs holes (i.e. [efficient representation of]
blocks in uncompressed data consisting entirely of NUL bytes), but by
default mkcramfs doesn't test for & create holes, since cramfs in
kernels up to at least 2.3.39 didn't support holes.  Run mkcramfs
with -z if you want it to create files that can have holes in them.


Tools
-----

The cramfs user-space tools, including mkcramfs and cramfsck, are
located at <http://sourceforge.net/projects/cramfs/>.


Future Development
==================

Block Size
----------

(Block size in cramfs refers to the size of input data that is
compressed at a time.  It's intended to be somewhere around
PAGE_SIZE for cramfs_read_folio's convenience.)

The superblock ought to indicate the block size that the fs was
written for, since comments in <linux/pagemap.h> indicate that
PAGE_SIZE may grow in future (if I interpret the comment
correctly).

Currently, mkcramfs #define's PAGE_SIZE as 4096 and uses that
for blksize, whereas Linux-2.3.39 uses its PAGE_SIZE, which in
turn is defined as PAGE_SIZE (which can be as large as 32KB on arm).
This discrepancy is a bug, though it's not clear which should be
changed.

One option is to change mkcramfs to take its PAGE_SIZE from
<asm/page.h>.  Personally I don't like this option, but it does
require the least amount of change: just change `#define
PAGE_SIZE (4096)' to `#include <asm/page.h>'.  The disadvantage
is that the generated cramfs cannot always be shared between different
kernels, not even necessarily kernels of the same architecture if
PAGE_SIZE is subject to change between kernel versions
(currently possible with arm and ia64).

The remaining options try to make cramfs more sharable.

One part of that is addressing endianness.  The two options here are
`always use little-endian' (like ext2fs) or `writer chooses
endianness; kernel adapts at runtime'.  Little-endian wins because of
code simplicity and little CPU overhead even on big-endian machines.

The cost of swabbing is changing the code to use the le32_to_cpu
etc. macros as used by ext2fs.  We don't need to swab the compressed
data, only the superblock, inodes and block pointers.


The other part of making cramfs more sharable is choosing a block
size.  The options are:

  1. Always 4096 bytes.

  2. Writer chooses blocksize; kernel adapts but rejects blocksize >
     PAGE_SIZE.

  3. Writer chooses blocksize; kernel adapts even to blocksize >
     PAGE_SIZE.

It's easy enough to change the kernel to use a smaller value than
PAGE_SIZE: just make cramfs_read_folio read multiple blocks.

The cost of option 1 is that kernels with a larger PAGE_SIZE
value don't get as good compression as they can.

The cost of option 2 relative to option 1 is that the code uses
variables instead of #define'd constants.  The gain is that people
with kernels having larger PAGE_SIZE can make use of that if
they don't mind their cramfs being inaccessible to kernels with
smaller PAGE_SIZE values.

Option 3 is easy to implement if we don't mind being CPU-inefficient:
e.g. get read_folio to decompress to a buffer of size MAX_BLKSIZE (which
must be no larger than 32KB) and discard what it doesn't need.
Getting read_folio to read into all the covered pages is harder.

The main advantage of option 3 over 1, 2, is better compression.  The
cost is greater complexity.  Probably not worth it, but I hope someone
will disagree.  (If it is implemented, then I'll re-use that code in
e2compr.)


Another cost of 2 and 3 over 1 is making mkcramfs use a different
block size, but that just means adding and parsing a -b option.


Inode Size
----------

Given that cramfs will probably be used for CDs etc. as well as just
silicon ROMs, it might make sense to expand the inode a little from
its current 12 bytes.  Inodes other than the root inode are followed
by filename, so the expansion doesn't even have to be a multiple of 4
bytes.
Commit	Line	Data
1da177e4 LT	1	Notes on Filesystem Layout
	2	--------------------------
	3
	4	These notes describe what mkcramfs generates. Kernel requirements are
	5	a bit looser, e.g. it doesn't care if the <file_data> items are
	6	swapped around (though it does care that directory entries (inodes) in
	7	a given directory are contiguous, as this is used by readdir).
	8
	9	All data is currently in host-endian format; neither mkcramfs nor the
	10	kernel ever do swabbing. (See section `Block Size' below.)
	11
	12	<filesystem>:
	13	<superblock>
	14	<directory_structure>
	15	<data>
	16
	17	<superblock>: struct cramfs_super (see cramfs_fs.h).
	18
	19	<directory_structure>:
	20	For each file:
	21	struct cramfs_inode (see cramfs_fs.h).
	22	Filename. Not generally null-terminated, but it is
	23	null-padded to a multiple of 4 bytes.
	24
	25	The order of inode traversal is described as "width-first" (not to be
	26	confused with breadth-first); i.e. like depth-first but listing all of
	27	a directory's entries before recursing down its subdirectories: the
	28	same order as `ls -AUR' (but without the /^\..*:$/ directory header
	29	lines); put another way, the same order as `find -type d -exec
	30	ls -AU1 {} \;'.
	31
	32	Beginning in 2.4.7, directory entries are sorted. This optimization
	33	allows cramfs_lookup to return more quickly when a filename does not
	34	exist, speeds up user-space directory sorts, etc.
	35
	36	<data>:
	37	One <file_data> for each file that's either a symlink or a
	38	regular file of non-zero st_size.
	39
	40	<file_data>:
	41	nblocks * <block_pointer>
	42	(where nblocks = (st_size - 1) / blksize + 1)
	43	nblocks * <block>
	44	padding to multiple of 4 bytes
	45
	46	The i'th <block_pointer> for a file stores the byte offset of the
	47	end of the i'th <block> (i.e. one past the last byte, which is the
	48	same as the start of the (i+1)'th <block> if there is one). The first
	49	<block> immediately follows the last <block_pointer> for the file.
	50	<block_pointer>s are each 32 bits long.
	51
fd4f6f2a NP	52	When the CRAMFS_FLAG_EXT_BLOCK_POINTERS capability bit is set, each
	53	<block_pointer>'s top bits may contain special flags as follows:
	54
	55	CRAMFS_BLK_FLAG_UNCOMPRESSED (bit 31):
	56	The block data is not compressed and should be copied verbatim.
	57
	58	CRAMFS_BLK_FLAG_DIRECT_PTR (bit 30):
	59	The <block_pointer> stores the actual block start offset and not
	60	its end, shifted right by 2 bits. The block must therefore be
	61	aligned to a 4-byte boundary. The block size is either blksize
	62	if CRAMFS_BLK_FLAG_UNCOMPRESSED is also specified, otherwise
	63	the compressed data length is included in the first 2 bytes of
	64	the block data. This is used to allow discontiguous data layout
	65	and specific data block alignments e.g. for XIP applications.
	66
	67
1da177e4 LT	68	The order of <file_data>'s is a depth-first descent of the directory
	69	tree, i.e. the same order as `find -size +0 \( -type f -o -type l \)
	70	-print'.
	71
	72
	73	<block>: The i'th <block> is the output of zlib's compress function
fd4f6f2a NP	74	applied to the i'th blksize-sized chunk of the input data if the
	75	corresponding CRAMFS_BLK_FLAG_UNCOMPRESSED <block_ptr> bit is not set,
	76	otherwise it is the input data directly.
1da177e4 LT	77	(For the last <block> of the file, the input may of course be smaller.)
1da177e4 LT	78	Each <block> may be a different size. (See <block_pointer> above.)
fd4f6f2a	79
1da177e4 LT	80	<block>s are merely byte-aligned, not generally u32-aligned.
1da177e4 LT	81
fd4f6f2a NP	82	When CRAMFS_BLK_FLAG_DIRECT_PTR is specified then the corresponding
	83	<block> may be located anywhere and not necessarily contiguous with
	84	the previous/next blocks. In that case it is minimally u32-aligned.
	85	If CRAMFS_BLK_FLAG_UNCOMPRESSED is also specified then the size is always
	86	blksize except for the last block which is limited by the file length.
	87	If CRAMFS_BLK_FLAG_DIRECT_PTR is set and CRAMFS_BLK_FLAG_UNCOMPRESSED
	88	is not set then the first 2 bytes of the block contains the size of the
	89	remaining block data as this cannot be determined from the placement of
	90	logically adjacent blocks.
	91
1da177e4 LT	92
	93	Holes
	94	-----
	95
	96	This kernel supports cramfs holes (i.e. [efficient representation of]
	97	blocks in uncompressed data consisting entirely of NUL bytes), but by
	98	default mkcramfs doesn't test for & create holes, since cramfs in
	99	kernels up to at least 2.3.39 didn't support holes. Run mkcramfs
	100	with -z if you want it to create files that can have holes in them.
	101
	102
	103	Tools
	104	-----
	105
	106	The cramfs user-space tools, including mkcramfs and cramfsck, are
	107	located at <http://sourceforge.net/projects/cramfs/>.
	108
	109
	110	Future Development
	111	==================
	112
	113	Block Size
	114	----------
	115
	116	(Block size in cramfs refers to the size of input data that is
	117	compressed at a time. It's intended to be somewhere around
5aab331a	118	PAGE_SIZE for cramfs_read_folio's convenience.)
1da177e4 LT	119
	120	The superblock ought to indicate the block size that the fs was
	121	written for, since comments in <linux/pagemap.h> indicate that
ea1754a0	122	PAGE_SIZE may grow in future (if I interpret the comment
1da177e4 LT	123	correctly).
1da177e4 LT	124
ea1754a0 KS	125	Currently, mkcramfs #define's PAGE_SIZE as 4096 and uses that
ea1754a0 KS	126	for blksize, whereas Linux-2.3.39 uses its PAGE_SIZE, which in
1da177e4 LT	127	turn is defined as PAGE_SIZE (which can be as large as 32KB on arm).
	128	This discrepancy is a bug, though it's not clear which should be
	129	changed.
	130
ea1754a0	131	One option is to change mkcramfs to take its PAGE_SIZE from
1da177e4 LT	132	<asm/page.h>. Personally I don't like this option, but it does
1da177e4 LT	133	require the least amount of change: just change `#define
ea1754a0	134	PAGE_SIZE (4096)' to `#include <asm/page.h>'. The disadvantage
1da177e4 LT	135	is that the generated cramfs cannot always be shared between different
1da177e4 LT	136	kernels, not even necessarily kernels of the same architecture if
ea1754a0	137	PAGE_SIZE is subject to change between kernel versions
1da177e4 LT	138	(currently possible with arm and ia64).
	139
	140	The remaining options try to make cramfs more sharable.
	141
	142	One part of that is addressing endianness. The two options here are
	143	`always use little-endian' (like ext2fs) or `writer chooses
	144	endianness; kernel adapts at runtime'. Little-endian wins because of
	145	code simplicity and little CPU overhead even on big-endian machines.
	146
	147	The cost of swabbing is changing the code to use the le32_to_cpu
	148	etc. macros as used by ext2fs. We don't need to swab the compressed
	149	data, only the superblock, inodes and block pointers.
	150
	151
	152	The other part of making cramfs more sharable is choosing a block
	153	size. The options are:
	154
	155	1. Always 4096 bytes.
	156
	157	2. Writer chooses blocksize; kernel adapts but rejects blocksize >
ea1754a0	158	PAGE_SIZE.
1da177e4 LT	159
1da177e4 LT	160	3. Writer chooses blocksize; kernel adapts even to blocksize >
ea1754a0	161	PAGE_SIZE.
1da177e4 LT	162
1da177e4 LT	163	It's easy enough to change the kernel to use a smaller value than
5aab331a	164	PAGE_SIZE: just make cramfs_read_folio read multiple blocks.
1da177e4	165
ea1754a0	166	The cost of option 1 is that kernels with a larger PAGE_SIZE
1da177e4 LT	167	value don't get as good compression as they can.
	168
	169	The cost of option 2 relative to option 1 is that the code uses
	170	variables instead of #define'd constants. The gain is that people
ea1754a0	171	with kernels having larger PAGE_SIZE can make use of that if
1da177e4	172	they don't mind their cramfs being inaccessible to kernels with
ea1754a0	173	smaller PAGE_SIZE values.
1da177e4 LT	174
1da177e4 LT	175	Option 3 is easy to implement if we don't mind being CPU-inefficient:
5aab331a	176	e.g. get read_folio to decompress to a buffer of size MAX_BLKSIZE (which
1da177e4	177	must be no larger than 32KB) and discard what it doesn't need.
5aab331a	178	Getting read_folio to read into all the covered pages is harder.
1da177e4 LT	179
	180	The main advantage of option 3 over 1, 2, is better compression. The
	181	cost is greater complexity. Probably not worth it, but I hope someone
	182	will disagree. (If it is implemented, then I'll re-use that code in
	183	e2compr.)
	184
	185
	186	Another cost of 2 and 3 over 1 is making mkcramfs use a different
	187	block size, but that just means adding and parsing a -b option.
	188
	189
	190	Inode Size
	191	----------
	192
	193	Given that cramfs will probably be used for CDs etc. as well as just
	194	silicon ROMs, it might make sense to expand the inode a little from
	195	its current 12 bytes. Inodes other than the root inode are followed
	196	by filename, so the expansion doesn't even have to be a multiple of 4
	197	bytes.