Hexbyte  Hacker News  Computers Using /proc to get a process’ current stack trace

Hexbyte Hacker News Computers Using /proc to get a process’ current stack trace

Hexbyte Hacker News Computers

Hey,

The file covered in this article, /proc//stack, is the one that motivated me to learn more about /proc and get The Month of Proc.

It’s such a useful thing when you’re unaware of what is the state of a given process. Meanwhile, I’ve noticed that it’s not very well known by people getting started with Linux.

Hexbyte  Hacker News  Computers Illustration of how the process stack can be inspected

Here in this post, you’ll get to know more about how procfs can gather a process’ stack trace, as well as get an idea of its usefulness.

A process that blocks

To kick things off, let’s start with the tailoring of a process that blocks – a TCP server.

In its most simplistic form, we can have a single-threaded TCP server that just receives traffic in a given thread and then processes its results.

Hexbyte  Hacker News  Computers Illustration of TCP server blocking the main thread

Naturally, there are two places where we can see the server blocking: at the accept(2) phase, of the read(2) phase – the first blocks until a client finishes the TCP handshake; the second, until data is available for read.

Here’s how a simplistic implementation in C would look like considering just the first blocking part (accepting):

/**
 * server_listen takes care of setting up the
 * passive socket that receives incoming TCP
 * connections.
 *
 * Returns the file descriptor corresponding to
 * the passive socket, of -1 in the case of
 * error.
 *
 * See https://ops.tips/blog/a-tcp-server-in-c/#creating-a-socket
 * for a reference implementation.
 */
int
server_listen();

/**
 * server_accept_and_close accepts connections that
 * finish their TCP handshake through the provided
 * @listen_fd argument.
 *
 * Once the connection is `accept`ed, it gets closed
 * immediately.
 */
int
server_accept_and_close(int listen_fd)
{
	int                conn_fd;
	int                err;
	socklen_t          client_len;
	struct sockaddr_in client_addr;

	client_len = sizeof(client_addr);

	printf("Accepting connections ...n");

	// Accept connections from the completed
	// connection queue (pops the latest conn
	// that succesfully finished the 3-way
	// handshake).
	//
	// Given that the queue might be empty, it
	// waits (i.e., blocks) until there's a connection
	// there.
	err = (conn_fd = accept(
	         listen_fd, (struct sockaddr*)&client_addr, &client_len));
	if (err == -1) {
		perror("accept");
		return err;
	}

	printf("Client connected! Going to close now.n");

	// Mark the connection as closed so that we
	// can get back to accepting new connections.
	err = close(conn_fd);
	if (err == -1) {
		perror("close");
		return err;
	}

	return 0;
}

/**
 * main execution - creates a passive socket
 * bound to a specific port that keeps receiving
 * incoming connections, accepting them and then
 * marking them as closed immediately.
 */
int
main(int argc, char** argv)
{
	int err;
	int listen_fd = server_listen();

	if (listen_fd == -1)
		return 1;

	for (;;) {
		err = server_accept_and_close(listen_fd);
		if (err == -1)
			return 1;
	}

	return 0;
}

If you’re not into how a TCP server can be implemented in C, make sure you check A TCP Server in C for a complete description of it.

Run the server, and then see it blocking!

Viewing the process kernel stack trace

Once the server is blocked, we can jump into /proc and check out what’s going on in the Kernel and figure out that it’s blocked on the accept syscall:

# Gather the in-kernel stack trace of
# the `accept.out` process (the simplistic
# TCP server that we're running).
cat /proc/$(pidof accept.out)/stack
[<0>] inet_csk_accept+0x246/0x380
[<0>] inet_accept+0x45/0x170
[<0>] SYSC_accept4+0xff/0x210
[<0>] SyS_accept+0x10/0x20
[<0>] do_syscall_64+0x73/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<0>] 0xffffffffffffffff

Although it might look like a weird stack trace at first, the structure is very straightforward to reason about.

Each line represents a function that was called (from looking at the stack calls), having the first part, that [<0>] thing, being the kernel address of the function, while the second, that do_syscall_64+... being the symbol name with the corresponding offset.

If [<0>] looks weird (like, not an actual address at all), it’s because it’s intended to be like that.

When fs/proc/base.c#proc_pid_stack (the method that gets called by the /proc implementation of the virtual filesystem) iterates over the stack frames, we can see that it hardcodes [<0>] as the address to be displayed:

/**
 * Iterates through the kernel stack frames of
 * the current task, displaying the kernel 
 * addresses of each function, as well as
 * their symbol name and offset.
 */
static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,
			  struct pid *pid, struct task_struct *task)
{
	struct stack_trace trace;
	unsigned long *entries;
	int err;
	int i;


        // Allocate some memory so that we can have
        // in our execution the whole stack of the
        // process (up to a given depth).
        //
	// The first argument to the kmalloc
	// is the size of the block to be allocated.
        //
	// The second argument is allocation flag.
        //
	// GFP_KERNEL means that allocation is performed on
	// behalf of a process running in the kernel space.
	//
	// This means that the calling function is executing
	// a system call on behalf of a process.
	//
	// Using GFP_KERNEL means that kmalloc can put the
	// current process to sleep waiting for a page when
	// called in low-memory situations.
        // 
	// See https://elixir.bootlin.com/linux/v4.15/source/include/linux/gfp.h#L219
	entries = kmalloc(MAX_STACK_TRACE_DEPTH * sizeof(*entries), GFP_KERNEL);
	if (!entries)
		return -ENOMEM;


        // With the space properly allocated, we can now
        // prepare the `stack_trace` struct and pass
        // it down to the function that will get that
        // for us.
	trace.nr_entries	= 0;
	trace.max_entries	= MAX_STACK_TRACE_DEPTH;
	trace.entries		= entries;
	trace.skip		= 0;


        // Make sure that we have mutual exclusion in
        // place.
	err = lock_trace(task);
	if (!err) {
                // Capture the stack trace!
		// https://elixir.bootlin.com/linux/v4.15/source/arch/x86/kernel/stacktrace.c#L69
		save_stack_trace_tsk(task, &trace);

                // Iterate over each frame captured.
                //
                // *************************
                // HERE is where the `[<0>]` is hardcoded.
                // *************************
		for (i = 0; i < trace.nr_entries; i++) {
			seq_printf(m, "[<0>] %pBn", (void *)entries[i]);
		}
		unlock_trace(task);
	}
	kfree(entries);

	return err;
}