gdb Debugging Full Example (Tutorial): ncurses

09 Aug 2016

I'm a little frustrated with finding "gdb examples" online that show the commands but not their output. gdb is the GNU Debugger, the standard debugger on Linux. I was reminded of the lack of example output when watching the Give me 15 minutes and I'll change your view of GDB talk by Greg Law at CppCon 2015, which, thankfully, includes output! It's well worth the 15 minutes.

It also inspired me to share a full gdb debugging example, with output and every step involved, including dead ends. This isn't a particularly interesting or exotic issue, it's just a routine gdb debugging session. But it covers the basics and could serve as a tutorial of sorts, bearing in mind there's a lot more to gdb than I used here.

I'll be running the following commands as root, since I'm debugging a tool that needs root access (for now). Substitute non-root and sudo as desired. You also aren't expected to read through all this: I've enumerated each step so you can browse them and find ones of interest.

1. The Problem

The bcc collection of BPF tools had a pull request for cachetop, which uses a top-like display to show page cache statistics by process. Great! However, when I tested it, it hit a segfault:

Note that it says "Segmentation fault" and not "Segmentation fault (core dumped)". I'd like a core dump to debug this. (A core dump is a copy of process memory – the name coming from the era of magnetic core memory – and can be investigated using a debugger.)

Core dump analysis is one approach for debugging, but not the only one. I could run the program live in gdb to inspect the issue. I could also use an external tracer to grab data and stack traces on segfault events. We'll start with core dumps.

2. Fixing Core Dumps

I'll check the core dump settings:

ulimit -c shows the maximum size of core dumps created, and it's set to zero: disabling core dumps (for this process and its children).

The /proc/.../core_pattern is set to just "core", which will drop a core dump file called "core" in the current directory. That will be ok for now, but I'll show how to set this up for a global location:

You can customize that core_pattern further; eg, %h for hostname and %t for time of dump. The options are documented in the Linux kernel source, under Documentation/sysctl/kernel.txt.

To make the core_pattern permanent, and survive reboots, you can set it via "kernel.core_pattern" in /etc/sysctl.conf.

Trying again:

That's better: we have our core dump.

3. Starting GDB

Now I'll run gdb with the target program location (using shell substitution, "`", although you should specify the full path unless you're sure that will work), and the core dump file:

The last two lines are especially interesting: it tells us it's a segmentation fault in the doupdate() function from the libncursesw library. That's worth a quick web search in case it's a well-known issue. I took a quick look but didn't find a single common cause.

I already can guess what libncursesw is for, but if that were foreign to you, then being under "/lib" and ending in ".so.*" shows it's a shared library, which might have a man page, website, package description, etc.

I happen to be debugging this on Ubuntu, but the Linux distro shouldn't matter for gdb usage.

4. Back Trace

Stack back traces show how we arrived at the point of fail, and are often enough to help identify a common problem. It's usually the first command I use in a gdb session: bt (short for backtrace):

Read from bottom up, to go from parent to child. The "??" entries are where symbol translation failed. Stack walking – which produces the stack trace – can also fail. In that case you'll likely see a single valid frame, then a small number of bogus addresses. If symbols or stacks are too badly broken to make sense of the stack trace, then there are usually ways to fix it: installing debug info packages (giving gdb more symbols, and letting it do DWARF-based stack walks), or recompiling the software from source with frame pointers and debugging information (-fno-omit-frame-pointer -g). Many of the above "??" entries can be fixed by adding the python-dbg package.

This particular stack doesn't look very helpful: frames 5 to 17 (indexed on the left) are Python internals, although we can't see the Python methods (yet). Then frame 4 is the _curses library, then we're in libncursesw. Looks like wgetch()->wrefresh()->doupdate(). Just based on the names, I'd guess a window refresh. Why would that core dump?

5. Disassembly

I'll start by disassembling the function we segfaulted in, doupdate():

Output truncated. (I could also have just typed "disas" and it would have defaulted to doupdate.)

The arrow "=>" is pointing to our segfault address, which is doing a mov 0x10(%rsi),%rdi: a move from the memory pointed to in the %rsi register plus an offset of 0x10, to the %rdi register. I'll check the state of the registers next.

6. Check Registers

Printing register state using i r (short for info registers):

Well, %rsi is zero. There's our problem! Zero is unlikely a valid address, and this type of segfault is a common software bug: dereferencing an uninitialized or NULL pointer.

7. Memory Mappings

You can double check if zero is valid using i proc m (short for info proc mappings):

The first valid virtual address is 0x400000. Anything below that is invalid, and if referenced, will trigger a segmentation fault.

At this point there are several different ways to dig further. I'll start with some instruction stepping.

8. Breakpoints

Back to the disassembly:

Reading these four instructions: it looks like it's pulling something from the stack into %rax, then dereferencing %rax into %rsi, the setting %eax to zero (the xor is an optimization, instead of doing a mov of $0), and then we dereference %rsi with an offset, although we know %rsi is zero. This sequence is for walking data structures. Maybe %rax would be interesting, but it's been set to zero by the prior instruction, so we can't see it in the core dump register state.

I can set a breakpoint on doupdate+289, then single-step through each instruction to see how the registers are set and change. First, I need to launch gdb so that we're executing the program live:

Now to set the breakpoint using b (short for break):

Oops. I wanted to show this error to explain why we often start out with a breakpoint on main, at which point the symbols are likely loaded, and then setting the real breakpoint of interest. I'll go straight to doupdatefunction entry, run the problem, then set the offset breakpoint once it hits the function:

We've arrived at our breakpoint.

If you haven't done this before, the r (run) command takes arguments that will be passed to the gdb target we specified earlier on the command line (python). So this ends up running "python".

9. Stepping

I'll step one instruction (si, short for stepi) then inspect registers:

Another clue. So the NULL pointer we're dereferencing looks like it's in a symbol called "cur_term" (p/a is short for print/a, where "/a" means format as an address). Given this is ncurses, is our TERM environment set to something odd?

I tried setting that to vt100 and running the program, but it hit the same segfault.

Note that I've inspected just the first invocation of doupdate(), but it could be called multiple times, and the issue may be a later invocation. I can step through each by running c (short for continue). That will be ok if it's only called a few times, but if it's called a few thousand times I'll want a different approach. (I'll get back to this in section 15.)

10. Reverse Stepping

gdb has a great feature called reverse stepping, which Greg Law included in his talk. Here's an example.

I'll start a python session again, to show this from the beginning:

Now I'll set a breakpoint on doupdate as before, but once it's hit, I'll enable recording, then continue the program and let it crash. Recording adds considerable overhead, so I don't want to add it on main.

At this point I can reverse-step through lines or instructions. It works by playing back register state from our recording. I'll move back in time two instructions, then print registers:

So, back to finding the "cur_term" clue. I really want to read the source code at this point, but I'll start with debug info.

11. Debug Info

This is libncursesw, and I don't have debug info installed (Ubuntu):

I'll add that:

Good, those versions match. So how does our segfault look now?

The stack trace looks a bit different: we aren't really in doupdate(), but ClrBlank(), which has been inlined in ClrUpdate(), and inlined in doupdate().

Now I really want to see source.

12. Source Code

With the debug info package installed, gdb can list the source along with the assembly:

Great! See the arrow "=>" and the line of code above it. So we're segfaulting on "if (back_color_erase)"? That doesn't seem possible. (A segfault would be due to a memory dereference, which in C would be a->b or *a, but in this case it's just "back_color_erase", which looks like it's accessing an ordinary variable and not dereferencing memory.)

At this point I double checked that I had the right debug info version, and re-ran the application to segfault it in a live gdb session. Same place.

Is there something special about back_color_erase? We're in ClrBlank(), so I'll list that source code:

Ah, that's not defined in the function, so it's a global?

13. TUI

It's worth showing how this looks in the gdb text user interface (TUI), which I haven't used that much but was inspired after seeing Greg's talk.

You can launch it using --tui:

# gdb --tui `which python` /var/cores/core.python.30520
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │             [ No Source Available ]                                       │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
None No process In:                                                L??   PC: ?? 
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
---Type  to continue, or q  to quit---

It's complaining about no Python source. I could fix that, but we're crashing in libncursesw. Hitting enter lets it finish loading, at which point it loads the libncursesw debug info source code:

   │1124                                                                       │
   │1125    static NCURSES_INLINE NCURSES_CH_T                                 │
   │1126    ClrBlank(NCURSES_SP_DCLx WINDOW *win)                              │
   │1127    {                                                                  │
   │1128        NCURSES_CH_T blank = blankchar;                                │
  >│1129        if (back_color_erase)                                          │
   │1130            AddAttr(blank, (AttrOf(BCE_BKGD(SP_PARM, win)) & BCE_ATTRS)│
   │1131        return blank;                                                  │
   │1132    }                                                                  │
   │1133                                                                       │
   │1134    /*                                                                 │
   │1135    **      ClrUpdate()                                                │
   │1136    **                                                                 │
multi-thre Thread 0x7f0a3c5e87 In: doupdate            L1129 PC: 0x7f0a37aac40d 
warning: JITed object file architecture unknown is not compatible with target ar
chitecture i386:x86-64.
---Type <return> to continue, or q <return> to quit---
Core was generated by `python ./'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  ClrBlank (win=0x1993060)
    at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129


The arrow ">" shows the line of code that we crashed in. It gets even better: with the layout splitcommand we can follow the source with the disassembly in separate windows:

  >│1129        if (back_color_erase)                                          │
   │1130            AddAttr(blank, (AttrOf(BCE_BKGD(SP_PARM, win)) & BCE_ATTRS)│
   │1131        return blank;                                                  │
   │1132    }                                                                  │
   │1133                                                                       │
   │1134    /*                                                                 │
   │1135    **      ClrUpdate()                                                │
  >│0x7f0a37aac40d <doupdate+301>   mov    0x10(%rsi),%rdi                     │
   │0x7f0a37aac411 <doupdate+305>   cmpb   $0x0,0x1c(%rdi)                     │
   │0x7f0a37aac415 <doupdate+309>   jne    0x7f0a37aac6f7 <doupdate+1047>      │
   │0x7f0a37aac41b <doupdate+315>   movswl 0x4(%rcx),%ecx                      │
   │0x7f0a37aac41f <doupdate+319>   movswl 0x74(%rdx),%edi                     │
   │0x7f0a37aac423 <doupdate+323>   mov    %rax,0x40(%rsp)                     │
   │0x7f0a37aac428 <doupdate+328>   movl   $0x20,0x48(%rsp)                    │
   │0x7f0a37aac430 <doupdate+336>   movl   $0x0,0x4c(%rsp)                     │
multi-thre Thread 0x7f0a3c5e87 In: doupdate            L1129 PC: 0x7f0a37aac40d 

chitecture i386:x86-64.
Core was generated by `python ./'.
Program terminated with signal SIGSEGV, Segmentation fault.
---Type <return> to continue, or q <return> to quit---
#0  ClrBlank (win=0x1993060)
    at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129
(gdb) layout split

Greg demonstrated this with reverse stepping, so you can imagine following both code and assembly execution at the same time (I'd need a video to demonstrate that here).

14. External: cscope

I still want to learn more about back_color_erase, and I could try gdb's search command, but I've found I'm quicker using an external tool: cscope. cscope is a text-based source code browser from Bell Labs in the 1980's. If you have a modern IDE that you prefer, use that instead.

Setting up cscope:

# apt-get install -y cscope
# wget
# tar xvf ncurses_6.0+20160213.orig.tar.gz
# cd ncurses-6.0-20160213
# cscope -bqR
# cscope -dq

cscope -bqR builds the lookup database. cscope -dq then launches cscope.

Searching for back_color_erase definition:

Cscope version 15.8b                                   Press the ? key for help

Find this C symbol:
Find this global definition: back_color_erase
Find functions called by this function:
Find functions calling this function:
Find this text string:
Change this text string:
Find this egrep pattern:
Find this file:
Find files #including this file:
Find assignments to this symbol:

Hitting enter:

#define non_dest_scroll_region         CUR Booleans[26]
#define can_change                     CUR Booleans[27]
#define back_color_erase               CUR Booleans[28]
#define hue_lightness_saturation       CUR Booleans[29]
#define col_addr_glitch                CUR Booleans[30]
#define cr_cancels_micro_mode          CUR Booleans[31]

Oh, a #define. (They could have at least capitalized it, as is a common style with #define's.)

Ok, so what's CUR? Looking up definitions in cscope is a breeze.

#define CUR cur_term->type.                                                     

At least that #define is capitalized!

We'd found cur_term earlier, by stepping instructions and examining registers. What is it?

#if 0 && !0
#elif 0
#define cur_term   NCURSES_PUBLIC_VAR(cur_term())

cscope read /usr/include/term.h for this. So, more macros. I had to highlight in bold the line of code I think is taking effect there. Why is there an "if 0 && !0 ... elif 0"? I don't know (I'd need to read more source). Sometimes programmers use "#if 0" around debug code they want to disable in production, however, this looks auto-generated.

Searching for NCURSES_EXPORT_VAR finds:



/* Take care of non-cygwin platforms */
#if !defined(NCURSES_IMPEXP)          
#  define NCURSES_IMPEXP /* nothing */
#if !defined(NCURSES_API)             
#  define NCURSES_API /* nothing */   
#if !defined(NCURSES_EXPORT)          
#if !defined(NCURSES_EXPORT_VAR)      

... and TERMINAL was:

typedef struct term {       /* describe an actual terminal */
    TERMTYPE    type;       /* terminal type description */
    short   Filedes;    /* file description being written to */
    TTY     Ottyb,      /* original state of the terminal */
        Nttyb;      /* current state of the terminal */
    int     _baudrate;  /* used to compute padding */
    char *      _termname;      /* used for termname() */

Gah! Now TERMINAL is capitalized. Along with the macros, this code is not that easy to follow...

Ok, who actually sets cur_term? Remember our problem is that it's set to zero, maybe because it's uninitialized or explicitly set. Browsing the code paths that set it might provide more clues, to help answer why it isn't being set, or why it is set to zero. Using the first option in cscope:

Find this C symbol: cur_term
Find this global definition:
Find functions called by this function:
Find functions calling this function:

And browsing the entries quickly finds:

    TERMINAL *oldterm;

    T((T_CALLED("set_curterm(%p)"), (void *) termp));

    oldterm = cur_term;
    if (SP_PARM)
    SP_PARM->_term = termp;
    CurTerm = termp;
    cur_term = termp;

I added the highlighting. Even the function name is wrapped in a macro. But at least we've found how cur_term is set: via set_curterm(). Maybe that isn't being called?

15. External: perf-tools/ftrace/uprobes

I'll cover using gdb for this in a moment, but I can't help trying the uprobe tool from my perf-tools collection, which uses Linux ftrace and uprobes. One advantage of using tracers is that they don't pause the target process, like gdb does (although that doesn't matter for this example). Another advantage is that I can trace a few events or a few thousand just as easily.

I should be able to trace calls to set_curterm() in libncursesw, and even print the first argument:

# /apps/perf-tools/bin/uprobe 'p:/lib/x86_64-linux-gnu/ %di'
ERROR: missing symbol "set_curterm" in /lib/x86_64-linux-gnu/

Well, that didn't work. Where is set_curterm()? There are lots of ways to find it, like gdb or objdump:

(gdb) info symbol set_curterm
set_curterm in section .text of /lib/x86_64-linux-gnu/

# objdump -tT /lib/x86_64-linux-gnu/ | grep cur_term
0000000000000000      DO *UND*  0000000000000000  NCURSES_TINFO_5.0.19991023 cur_term
# objdump -tT /lib/x86_64-linux-gnu/ | grep cur_term
0000000000228948 g    DO .bss   0000000000000008  NCURSES_TINFO_5.0.19991023 cur_term

gdb works better. Plus if I took a closer look at the source, I would have noticed it was building it for libtinfo.

Trying to trace set_curterm() in libtinfo:

That works. So set_curterm() is called, and has been called four times. The last time it was passed zero, which sounds like it could be the problem.

If you're wondering how I knew the %di register was the first argument, then it comes from the AMD64/x86_64 ABI (and the assumption that this compiled library is ABI compliant). Here's a reminder:

I'd also like to see a stack trace for arg1=0x0 invocation, but this ftrace tool doesn't support stack traces yet.

16. External: bcc/BPF

Since we're debugging a bcc tool,, it's worth noting that bcc's has capabilities like my older uprobe tool:

Yes, we're using bcc to debug bcc!

If you are new to bcc, it's worth checking it out. It provides Python and lua interfaces for the new BPF tracing features that are in the Linux 4.x series. In short, it allows lots of performance tools that were previously impossible or prohibitively expensive to run. I've posted instructions for running it on Ubuntu Xenial.

The bcc tool should have a switch for printing user stack traces, since the kernel now has BPF stack capabilities as of Linux 4.6, although at the time of writing we haven't added this switch yet.

17. More Breakpoints

I should really have used gdb breakpoints on set_curterm() to start with, but I hope that was an interesting detour through ftrace and BPF.

Back to live running mode:

Ok, at this breakpoint we can see that set_curterm() is being invoked with a termp=0x0 argument, thanks to debuginfo for that information. If I didn't have debuginfo, I could just print the registers on each breakpoint.

I'll print the stack trace so that we can see who was setting curterm to 0.

Ok, more clues...I think. We're in llvm::sys::Process::FileDescriptorHasColors(). The llvm compiler?

18. External: cscope, take 2

More source code browsing using cscope, this time in llvm. The FileDescriptorHasColors() function has:

Here's what that code used to be in an earlier version:

It became a "silly dance" involving calling set_curterm() with a null pointer.

19. Writing Memory

As an experiment and to explore a possible workaround, I'll modify memory of the running process to avoid the set_curterm() of zero.

I'll run gdb, set a breakpoint on set_curterm(), and take it to the zero invocation:

At this point I'll use the set command to overwrite memory and replace zero with the previous argument of set_curterm(), 0xbecb90, seen above, on the hope that it's still valid.

WARNING: Writing memory is not safe! gdb won't ask "are you sure?". If you get it wrong or make a typo, you will corrupt the application. Best case, your application crashes immediately, and you realize your mistake. Worst case, your application continues with silently corrupted data that is only discovered years later.

In this case, I'm experimenting on a lab machine with no production data, so I'll continue. I'll print the value of the %rdi register as hex (p/x), then set it to the previous address, print it again, then print all registers:

(Since at this point I have debug info installed, I don't need to refer to registers in this case, I could have called set on "termp", the variable name argument to set_curterm(), instead of $rdi.)

%rdi is now populated, so those registers look ok to continue.

Ok, we survived a call to set_curterm()! However, we've hit another, also with an argument of zero. Trying our write trick again:

(gdb) set $rdi=0xbecb90
(gdb) c
warning: JITed object file architecture unknown is not compatible with target architecture i386:x86-64.

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff34ad411 in ClrBlank (win=0xaea060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129
1129        if (back_color_erase)

Ahhh. That's what I get for writing memory. So this experiment ended in another segfault.

20. Conditional Breakpoints

In the previous section, I had to use three continues to reach the right invocation of a breakpoint. If that were hundreds of invocations, then I'd use a conditional breakpoint. Here's an example.

I'll run the program and break on set_curterm() as usual:

Now I'll turn breakpoint 1 into a conditional breakpoint, so that it only fires when the %rdi register is zero:

Neat! cond is short for conditional. So why didn't I run it right away, when I first created the "pending" breakpoint? I've found conditionals don't work on pending breakpoints, at least on this gdb version. (Either that or I'm doing it wrong.) I also used i b here (info breakpoints) to list them with information.

21. Returns

I did try another write-like hack, but this time changing the instruction path rather than the data.

WARNING: see previous warning, which also applies here.

I'll take us to the set_curterm() 0x0 breakpoint as before, and then issue a ret (short for return), which will return from the function immediately and not execute it. My hope is that by not executing it, it won't set the global curterm to 0x0.

Another crash. Again, that's what I get for messing in this way.

One more try. After browsing the code a bit more, I want to try doing a ret twice, in case the parent function is also involved. Again, this is just a hacky experiment:

The screen goes blank and pauses...then redraws:

Wow! It's working!

22. A Better Workaround

I'd been posting debugging output to github, especially since the lead BPF engineer, Alexei Starovoitov, is also well versed in llvm internals, and the root cause seemed to be a bug in llvm. While I was messing with writes and returns, he suggested adding the llvm option -fno-color-diagnostics to bcc, to avoid this problem code path. It worked! It was added to bcc as a workaround. (And we should get that llvm bug fixed.)

23. Python Context

At this point we've fixed the problem, but you might be curious to see the stack trace fully fixed.

Adding python-dbg:

Now I'll rerun gdb and view the stack trace:

No more "??"'s, but not hugely more helpful, yet.

The python debug packages have added other capabilities to gdb. Now we can look at the python backtrace:

... and Python source list:

It's identifying where in our Python code we were executing that hit the segfault. That's really nice!

The problem with the initial stack trace is that we're seeing Python internals that are executing the methods, but not the methods themselves. If you're debugging another language, it's up to its complier/runtime how it ends up executing code. If you do a web search for "language name" and "gdb" you might find it has gdb debugging extensions like Python does. If it doesn't, the bad news is you'll need to write your own. The good news is that this is even possible! Search for documentation on "adding new GDB commands in Python", as they can be written in Python.

24. And More

While it might look like I've written comprehensive tour of gdb, I really haven't: there's a lot more to gdb. The help command will list the major sections:

You can then run help on each command class. For example, here's the full listing for breakpoints:

This helps to illustrate how many capabilities gdb has, and how few I needed to use in this example.

25. Final Words

Well, that was kind of a nasty issue: an LLVM bug breaking ncurses and causing a Python program to segfault. But the commands and procedures I used to debug it were mostly routine: viewing stack traces, checking registers, setting breakpoints, stepping, and browsing source.

When I first used gdb (years ago), I really didn't like it. It felt clumsy and limited. gdb has improved a lot since then, as have my gdb skills, and I now see it as a powerful modern debugger. Feature sets vary between debuggers, but gdb may be the most powerful text-based debugger nowadays, with lldb catching up.

I hope anyone searching for gdb examples finds the full output I've shared to be useful, as well as the various caveats I discussed along the way. Maybe I'll post some more gdb sessions when I get a chance, especially for other runtimes like Java.

It's q to quit gdb.

原文 中文译文