Let’s say that you have a program which relies on huge pages for performance. I couldn’t find a resource fully explaining how to allocate huge pages at runtime, making sure that the huge page allocation was successful, so here it is.
High level steps (or skip to the code):
Make sure that transparent huge pages are enabled:1
% cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
madvise
or always
are what we want.
Run the program where you want to perform this check as root.2
Allocate memory using aligned_alloc
or posix_memalign
, with a 2MiB alignment — the huge page size. Linux also supports 1GiB huge pages on some systems, but here we’ll be working with 2MiB pages:34
void* buf = aligned_alloc(1 << 21, size);
Instruct the kernel to allocate the page using a huge pages with madvise
:
(buf, size, MADV_HUGEPAGE) madvise
It is important to issue this command before the page is allocated (next step). Also, this step is not needed if transparent huge pages are set to always
.
For each 2MiB chunk in your buffer:
Allocate the page backing your buffer — setting the first byte for each page would be enough:
(buf, 0, 1); memset
Get the page frame number (PFN) by reading /proc/self/pagemap
.
See if the KPF_THP
flag is set for the PFN retrieved above in /proc/kpageflags
.
The gory details:
#include <errno.h>
#include <fcntl.h>
#include <linux/kernel-page-flags.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>
#define fail(...) do { fprintf(stderr, __VA_ARGS__); exit(EXIT_FAILURE); } while (0)
// normal page, 4KiB
#define PAGE_SIZE (1 << 12)
// huge page, 2MiB
#define HPAGE_SIZE (1 << 21)
// See <https://www.kernel.org/doc/Documentation/vm/pagemap.txt> for
// format which these bitmasks refer to
#define PAGEMAP_PRESENT(ent) (((ent) & (1ull << 63)) != 0)
#define PAGEMAP_PFN(ent) ((ent) & ((1ull << 55) - 1))
static void check_huge_page(void* ptr);
int main(void) {
// allocate 10 huge pages
size_t size = HPAGE_SIZE * 10;
void* buf = aligned_alloc(HPAGE_SIZE, size);
if (!buf) {
("could not allocate buffer: %s", strerror(errno));
fail}
(buf, size, MADV_HUGEPAGE);
madvise// allocate and check each page
for (void* end = buf + size; buf < end; buf += HPAGE_SIZE) {
// allocate page
(buf, 0, 1);
memset// check the page is indeed huge
(buf);
check_huge_page}
("all good, exiting\n");
printfreturn 0;
}
// Checks if the page pointed at by `ptr` is huge. Assumes that `ptr` has already
// been allocated.
static void check_huge_page(void* ptr) {
int pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
if (pagemap_fd < 0) {
("could not open /proc/self/pagemap: %s", strerror(errno));
fail}
int kpageflags_fd = open("/proc/kpageflags", O_RDONLY);
if (kpageflags_fd < 0) {
("could not open /proc/kpageflags: %s", strerror(errno));
fail}
// each entry is 8 bytes long
uint64_t ent;
if (pread(pagemap_fd, &ent, sizeof(ent), ((uintptr_t) ptr) / PAGE_SIZE * 8) != sizeof(ent)) {
("could not read from pagemap\n");
fail}
if (!PAGEMAP_PRESENT(ent)) {
("page not present in /proc/self/pagemap, did you allocate it?\n");
fail}
if (!PAGEMAP_PFN(ent)) {
("page frame number not present, run this program as root\n");
fail}
uint64_t flags;
if (pread(kpageflags_fd, &flags, sizeof(flags), PAGEMAP_PFN(ent) << 3) != sizeof(flags)) {
("could not read from kpageflags\n");
fail}
if (!(flags & (1ull << KPF_THP))) {
("could not allocate huge page\n");
fail}
if (close(pagemap_fd) < 0) {
("could not close /proc/self/pagemap: %s", strerror(errno));
fail}
if (close(kpageflags_fd) < 0) {
("could not close /proc/kpageflags: %s", strerror(errno));
fail}
}
Some useful resources apart what was already linked:
page-info
, a small library by Travis Downs to get most of the information out of /proc/[pid]/pagemap
and /proc/kpageflags
.transhuge-stress.c
, a useful stress test for page tables found in the kernel tree.CONFIG_TRANSPARENT_HUGEPAGE
needs to be enabled in the kernel config for things to work, but this has been the case for all the systems I’ve tried, and I didn’t bother checking what happens to /sys/kernel/mm/transparent_hugepage/enabled
if it’s not enabled.↩︎
Getting the page frame number (PFN) from /proc/self/pagemap
requires CAP_SYS_ADMIN
capability, therefore it would be possible to read it as a normal user by issuing
% sudo setcap cap_sys_admin+ep <executable>
And then enable dumping explicitly with
(PR_SET_DUMPABLE, 1, 0, 0) prctl
The “dumpable” flag regulates whether the /proc/[pid]
files are owned to the user or to root
, as described in the man page for /proc/[pid]
:
The files inside each
/proc/[pid]
directory are normally owned by the effective user and effective group ID of the process. However, as a security measure, the ownership is maderoot:root
if the process’s “dumpable” attribute is set to a value other than 1.
The dumpable flag is normally set, but if we set the capability like described above it is not, as described in this StackOverflow answer.
However, even after doing all this work, we still won’t be able to read from /proc/kpageflags
, which is only readable by root
🙃.↩︎
The man page for madvise
states (emphasis mine):
↩︎Enable Transparent Huge Pages (THP) for pages in the range specified by addr and length. Currently, Transparent Huge Pages work only with private anonymous pages (see
mmap(2)
). The kernel will regularly scan the areas marked as huge page candidates to replace them with huge pages. The kernel will also allocate huge pages directly when the region is naturally aligned to the huge page size (seeposix_memalign(2)
).
Travis Downs pointed out that mmap
might be a safer option, since aligned_alloc
and friends might preemptively allocate pages.
Moreover, Paul Khuong provided a way to easily get a huge page aligned area using mmap
.↩︎