Linux (and most other operating systems) provide a transparent layer of caching for auxiliary storage (hard drives, etc…). This layer allows fast access to frequently used files on disk by keeping their content in memory and reading from there when necessary.

The kernel can use any free space in RAM as page cache. If the system requires more memory, the kernel might free space used by this cache and provide it to the application that needs it.

Inspecting page cache

Page cache is stored in RAM, but the space can be reclaimed by the kernel for applications whenever necessary. In Linux we can see how many bytes of RAM are used for page cache:

1
2
3
4
~ $ free -h
              total        used        free      shared  buff/cache   available
Mem:            19G        5.4G        7.8G        1.5G        6.2G         12G
Swap:           19G          0B         19G
  • Total: The total amount of RAM available
  • Used: total – free – buffers – cache
  • Free: Memory not being used
  • Shared: Used by tmpfs
  • Buff/cache: Memory used by buffers and page cache
  • Available: Estimate of how much memory can be reclaimed by apps (All the free space + some of the buff/cache)

Reading

When we read a file from auxiliary storage, its contents are stored in page cache. If we try to read that same file again, the read will go faster. We can see that after reading a file from disk, our buff/cache value increases, but after subsequent reads, it doesn’t increase anymore:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
~ $ free -h
              total        used        free      shared  buff/cache   available
Mem:            19G        5.3G         11G        1.5G        2.6G         12G
Swap:           19G          0B         19G
~ $ cat big.txt > /dev/null
~ $ free -h
              total        used        free      shared  buff/cache   available
Mem:            19G        5.4G         11G        1.5G        2.8G         12G
Swap:           19G          0B         19G
~ $ cat big.txt > /dev/null
~ $ free -h
              total        used        free      shared  buff/cache   available
Mem:            19G        5.4G         11G        1.5G        2.8G         12G
Swap:           19G          0B         19G

Writing

The page cache is really useful for making reads faster, but it also affects the way writes work. When we perform a write, the write doesn’t go directly to the auxiliary device. The kernel first modifies the content in cache and marks that section of the cache as dirty. It is possible to see how much memory the system has marked as dirty:

1
2
~ $ cat /proc/meminfo | grep Dirty
Dirty:               248 kB

This memory is flushed to the disk periodically by the kernel. Parts of the page cache that are marked as dirty can’t be reclaimed by other applications until they are flushed to disk.

There are a few settings that control when the dirty cache is written to disk:

1
2
3
4
5
6
7
~ $ sysctl -a | grep dirty
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
  • vm.dirty_background_ratio – The percentage of system memory that can be filled with dirty pages
  • vm.dirty_ratio – Maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk. When this value is reached, I/O blocks until all dirty pages have been committed to disk
  • vm.dirty_background_bytes – Same as with _ratio. Only one of these two can be set
  • vm.dirty_bytes – Same as with _ratio. Only one of these two can be set
  • vm.dirty_expire_centisecs – Any pages that have been dirty for longer than this time, will be written to disk
  • vm.dirty_writeback_centisecs – How often the kernel checks to see if there is data to be flushed

In my system, there is s thread that will wake up every 5 seconds and check if there are any dirty pages older than 30 seconds and write them to disk.

Having a page marked as dirty means that it hasn’t been written to disk. This means that in the scenario of a system crash, there will be data loss. This behavior is scary and unacceptable for some applications (e.g. Databases). For scenarios where data loss is not an option, there are system functions that force the kernel to write to disk.

If you were a developer writing a database, you want to give your user a guarantee that once a record has been inserted successfully, it will be there forever (even if the system crashes). To do this, we can use fsync. fsync, given a file descriptor will flush any dirty pages associated with that file. If the call returns 0, we can be sure that the data has been persisted to disk.

Because calling fsync constantly is not very elegant there is also an option to open a file with the O_SYNC flag. This guarantees that all writes on that file will be performed synchronously.

[ debugging  linux  ]
Passing Arguments To a Java Binary Ran With Bazel
Get arguments used to start a java process
Getting PID for Java Process
Installing Ubuntu on an old computer - Broken graphics
Using lsof to find who is using a file or socket