Wednesday, July 9, 2014

Adding a system call in X86 QEMU Environment


Generally Linux programmers always curious about system calls and the way it is implemented. As a Linux programmer, I just wanted to learn system call implementation by actually using it in my application. This will completes the cycle.
The environment is i386 QEMU with 3.8 kernel.
Here is the problem statement.
"Just add a system call which takes two integer as input and returns addition of those two".
Step 1: Download the Linux kernel from kernel.org (3.8 version) and extract it.
Step 2: Adding system call by modifying kernel
Step 2.1: In linux-3.8/arch/x86/syscalls/syscall_32.tbl file add an entry.
#
# 32-bit system call numbers and entry vectors
#
# The format is:
# <number> <abi> <name> <entry point> <compat entry point>
#
# The abi is always "i386" for this file.
#
0       i386    restart_syscall         sys_restart_syscall
1       i386    exit                    sys_exit
2       i386    fork                    sys_fork                        stub32_fork
3       i386    read                    sys_read
.............
.............
348     i386    process_vm_writev       sys_process_vm_writev           compat_sys_process_vm_writev
349     i386    kcmp                    sys_kcmp
350     i386    finit_module            sys_finit_module
351     i386    jeyAdd                  sys_jeyAdd
jeyAdd is the newly added system call. This file contains system call numbers and corresponding entry vectors for i386(Of course, file name itself explains!!)
Step 2.2: In linux-3.8/include/linux/syscalls.h add declaration of sys_jeyAdd() like
asmlinkage long sys_jeyAdd(int a, int b);
Step 2.3: Add a new C file say JeyCalc.c with the following content
/* This file is added for testing system calls flow. */
#include <linux/syscalls.h>

SYSCALL_DEFINE2(jeyAdd, int, arg1, int, arg2)
{
        return (arg1 + arg2);
}
Step 2.4: In kernel Makefile(linux-3.8/kernel/Makefile), include the newly added file JeyCalc.c
#
# Makefile for the linux kernel.
#

obj-y     = ...
            ... 
            async.o range.o groups.o lglock.o smpboot.o JeyCalc.o
Step 2.5: Compile the kernel for bzImage.
Step 3: Now lets write an application for using our newly created system call. 
constraints 
  1. Our C library doesn't know about new system call 
  2. My file system don't have C libraries(Forgot to compile busybox with C libs). 
Solutions 
syscall() function in unistd.h can be used with syscall number to invoke it. In our case, 351. 

Here is the sample application
#include <stdio.h>
#include <unistd.h>

int main()
{
        int a = 0;

        printf("\n Test Application for My Syscalls \n");
        a = syscall(351, 10, 20);

        printf("\n Result: %d\n", a);

        return 0;
}
If this application is compiled with -static then there is no need for C libs. It works fine.


Step 4: Mount the file system and copy the statically linked application to it.
Step 5: Load the QEMU with modified kernel's bzImage and file system.
Step 6: Go to Application's directory and run it. Yeah!!! You made it.
References

Monday, June 30, 2014

Create a simple file system



This post is about creating a simple file system and load with QEMU(X86).
Requirements
  1. QEMU
  2. Kernel BzImage
  3. Busy box 
Step 1: For Installing QEMU on ubuntu machine
      sudo apt-get install qemu-system
Step 2: Download kernel from kernel.org and compile the kernel for bzImage
Step 3: Download the busybox from http://git.busybox.net/busybox/?h=1_22_stable
      /* Create a folder for holding folders*/          
      $ mkdir myfilesys

      /* Go to busybox folder and compile*/
      $ make menuconfig             <-- In this step select "Statically linked busybox"
      $ make
      $ make install CONFIG_PREFIX=path/to/myfilesys

      /* Create the standard directories */
      $ mkdir dev etc etc/init.d bin proc mnt tmp var var/shm  
      $ chmod 755 . dev etc etc/init.d bin proc mnt tmp var var/shm

      /* dev folder settings */
      $ mknod tty c 5 0
      $ mknod console c 5 1
      $ chmod 666 tty console
      $ mknod tty0 c 4 0
      $ chmod 666 tty0
      $ mknod ram0 b 1 0
      $ chmod 600 ram0
      $ mknod null c 1 3
      $ chmod 666 null

      /*  etc folder settings */ 
      $ cd myfilesys/etc/
      $ vi init.d/rcS        <-- Create a file name "rcS" and add the following command
         #! /bin/sh
         mount -a # Mount the default file systems mentioned in /etc/fstab.
      $ chmod 744 init.d/rcS
      $ vi fstab             <-- Create fstab file and copy the following content
             proc  /proc      proc    defaults     0      0
             none  /var/shm   shm     defaults     0      0
      $ chmod 644 fstab
      $ vi inittab           <-- Create inittab file 
           ::sysinit:/etc/init.d/rcS
           ::askfirst:/bin/sh
      $ chmod 644 inittab

      /* Create an ext2 filesystem image by running these commands */ 
      $ dd if=/dev/zero of=my.img bs=1M count=2
      $ mkfs.ext2 -N 512 my.img

      /* mount the filesystem and copy the folders which created early (myfilesys contents)*/
      $ mount -t ext2 my.img /mnt
      $ cp -fr myfilesys/* /mnt
      $ umount /mnt
Step 4: Now we have bzImagefile system and installed QEMU. Issue the following command for loading our images.
     $ qemu-system-i386 -kernel bzImage -hda my.img -append=/dev/sda
Now QEMU loads newly build kernel and file system. Hurrayyyyyyy !!! We made it. If you want to explore further , try creating new directories, do some extra stuff :)
References

Monday, June 23, 2014

kmemleak in ubuntu


  1. kmemleak is a kernel debugging tool which is used for collecting memory leak information
  2. This kmemleak is kernel version of valgrind's memcheck --leak-check
  3. The orphan objects are not freed but only reported via /sys/kernel/debug/kmemleak
  4. Compile the kernel with CONFIG_DEBUG_KMEMLEAK.
Follow the instructions given in the following link for compiling a new kernel and installing in ubuntu machine.http://mitchtech.net/compile-linux-kernel-on-ubuntu-12-04-lts-detailed/
Step 1: Go to root shell mode by sudo -i
Step 2: Check kmemleak availability using dmesg | grep kmemleak 
dmesg | grep kmemleak
[    1.000175] kmemleak: Kernel memory leak detector initialized
[    1.000274] kmemleak: Automatic memory scanning thread started
Step 3: change the permission of /sys/kernel/debug/kmemleak. By default, it will read-only.
$ ls -l /sys/kernel/debug/kmemleak 
-r--r--r-- 1 root root 0 Jun 23 13:23 /sys/kernel/debug/kmemleak

$ chmod 777 /sys/kernel/debug/kmemleak
$ ls -l /sys/kernel/debug/kmemleak 
-rwxrwxrwx 1 root root 0 Jun 23 13:23 /sys/kernel/debug/kmemleak
Step 4: Compile the following kernel module
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/slab.h>

MODULE_LICENSE("GPL");

int __init ourinitmodule(void)
{
        int *a = NULL, *b = NULL;
        printk(KERN_ALERT "\n Welcome to sample application.... \n");
        b = kmalloc(1024, GFP_KERNEL);    //Intentionally kept for testing kmemleak
        a = kmalloc(1024, GFP_KERNEL);
        a[0] = 10;
        kfree(a);
        return 0;
}

void __exit ourcleanupmodule(void)
{
        printk(KERN_ALERT "\n Thanks....Exiting Application. \n");
}

module_init(ourinitmodule);
module_exit(ourcleanupmodule);
Step 5: Insert the module and unload using insmod & rmmod
Step 6: Wait for following message in dmesg
[  325.438226]
[  325.438226]  Welcome to sample application....
[  360.964221]
[  360.964221]  Thanks....Exiting Application.
[ 1263.301682] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
Since kmemleak's default scan frequency is 10 mins, Wait for 10 mins to get this message (Note: This frequency can be programmed, we will discuss this later.)
Step 7: Print memory leak report by $cat /sys/kernel/debug/kmemleak 
unreferenced object 0xe7801800 (size 1024):
  comm "insmod", pid 2700, jiffies 6359 (age 2367.608s)
  hex dump (first 32 bytes):
    00 1c 80 e7 24 0a 30 ff 24 0a 30 ff 24 0a 30 ff  ....$.0.$.0.$.0.
    24 0a 30 ff 24 0a 30 ff 24 0a 30 ff 24 0a 30 ff  $.0.$.0.$.0.$.0.
  backtrace:
    [<c15da9ec>] kmemleak_alloc+0x2c/0x60
    [<c114ae06>] kmem_cache_alloc_trace+0x96/0x130
    [<f847c028>] 0xf847c028
    [<c1003132>] do_one_initcall+0x112/0x160
    [<c10acb4a>] load_module+0x1e8a/0x2660
    [<c10ad398>] sys_init_module+0x78/0xb0
    [<c15f850d>] sysenter_do_call+0x12/0x28
    [<ffffffff>] 0xffffffff
From the above log, we observe that there are 1024 un-referenced bytes.

Friday, June 20, 2014

Linux kernel interview questions - 1


I have listed few questions which gives basic idea of Linux kernel programming. Mostly kernel related interview is about the work you have done in past. For example, if you have done I2C & SPI drivers then most of the question is from those interfaces. If interviewer wants to know about your kernel knowledge, then he/she may ask general kernel related questions. I will update these questions on regular basis.

1. What is __init , __initdata ??

       These macros are used to mark some functions or initialized data (doesn't apply to uninitialized data) as "initialization" functions.The kernel can take this as hint that the function is used only during the initialization phase and free up used memory resources after.

__init will be defined as

#define __init   __section(.init.text) __cold notrace

and internally it will be expanded as

#define __section(S) __attribute__ ((__section__(#S)))

2. what is module_init() and module_exit() ??

These are macros which provides appropriate flags(boilerplates) to compiler for ensuring the treatment of init and cleanup functions.
Here it is defined. 

References:
module_init and init_module of linux

3. What is EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL() ??

           If programmer wants some symbols(function/data) to be used in other kernel modules, then those symbols should be exported using these macro. As name implies, EXPORT_SYMBOL_GPL() exports symbols only to GPL licensed modules.

4. What is modprobe, insmod, rmmod & depmod ?
5. What is initcall mechanism??
6. Which function will be the first function to be called in linux kernel?





Friday, May 23, 2014

ftrace - The kernel function tracer - 2


In this post, we discuss about function and function_graph scenario of ftrace usage.
To find out which tracers are available, simply cat the available_tracers file in the tracing directory:
# cat available_tracers 
blk function_graph mmiotrace wakeup_rt wakeup function nop
By default, no tracer will be set.
# cat current_tracer 
nop
Let set current tracer as function
# echo function > current_tracer 
# cat current_tracer 
function
Enable the tracing
# echo 1 > tracing_on
Viewing trace log
# cat trace | head -50
# tracer: function
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
        metacity-1920  [000]  5127.451220: __pollwait <-unix_poll
        metacity-1920  [000]  5127.451220: add_wait_queue <-__pollwait
        metacity-1920  [000]  5127.451220: _raw_spin_lock_irqsave <-add_wait_queue
        metacity-1920  [000]  5127.451221: _raw_spin_unlock_irqrestore <-add_wait_queue
        metacity-1920  [000]  5127.451221: fput <-do_poll.isra.5
        metacity-1920  [000]  5127.451221: poll_freewait <-do_sys_poll
        metacity-1920  [000]  5127.451221: remove_wait_queue <-poll_freewait
        metacity-1920  [000]  5127.451222: _raw_spin_lock_irqsave <-remove_wait_queue
        metacity-1920  [000]  5127.451222: _raw_spin_unlock_irqrestore <-remove_wait_queue
        metacity-1920  [000]  5127.451222: fput <-poll_freewait
        metacity-1920  [000]  5127.451223: sys_writev <-syscall_call
        metacity-1920  [000]  5127.451223: fget_light <-sys_writev
        metacity-1920  [000]  5127.451223: vfs_writev <-sys_writev
        metacity-1920  [000]  5127.451224: do_readv_writev <-vfs_writev
        metacity-1920  [000]  5127.451224: rw_copy_check_uvector <-do_readv_writev
In the above example, sample call sequence.
unix_poll --> __pollwait --> add_wait_queue --> _raw_spin_lock_irqsave 
                                                _raw_spin_unlock_irqrestore 
Do you have trouble in finding calling sequence??? set function_graph as current tracer.
# echo function_graph > current_tracer
Function graph with timing information.
# cat trace | head -50
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 2)   0.104 us    |            _raw_spin_lock_irqsave();
 2)   0.104 us    |            _raw_spin_unlock_irqrestore();
 2)               |            ep_poll_readyevents_proc() {
 2)               |              ep_scan_ready_list.isra.8() {
 2)               |                mutex_lock() {
 2)   0.065 us    |                  _cond_resched();
 2)   0.602 us    |                }
 2)   0.070 us    |                _raw_spin_lock_irqsave();
 2)   0.107 us    |                _raw_spin_unlock_irqrestore();
 2)   0.068 us    |                ep_read_events_proc();
 2)   0.070 us    |                _raw_spin_lock_irqsave();
 2)   0.106 us    |                _raw_spin_unlock_irqrestore();
 2)   0.068 us    |                mutex_unlock();
 2)   4.642 us    |              }
 2)   5.244 us    |            }
Now I'm very much interested in listing kernel functions used by ls command. There are four commands executed in single line.
# echo > trace && echo 1 > tracing_on && ls && echo 0 > tracing_on
  ------------    -------------------    --    -------------------
    |                   |                              | 
    |                   |                              |
    |                   V                              V
    |              Enable Tracing               Disable tracing after "ls"  
    V                 
 Clear trace Log 
Lets see the log.
# vim trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 1)   0.268 us    |  __fsnotify_parent();
 1)               |  fsnotify() {
 1)   0.088 us    |    __srcu_read_lock();
 1)   0.064 us    |    __srcu_read_unlock();
 1)   1.354 us    |  }
 .....
 .....
 1)               |            mem_cgroup_charge_common() {
 1)   0.066 us    |              lookup_page_cgroup();
 ------------------------------------------
 2)    <idle>-0    =>    ls-2556                          <-- Here starts ls functions 
 ------------------------------------------

 2)               |  schedule_tail() {
 1)   0.101 us    |              __mem_cgroup_try_charge();
 2)   0.252 us    |    finish_task_switch();
 1)               |              __mem_cgroup_commit_charge() {
 2)   0.081 us    |    _cond_resched();
 .....
 .....
 1)   0.065 us    |            native_pte_clear();
 2)   0.109 us    |                native_load_sp0();
 1)   0.447 us    |            native_flush_tlb_single();
 2)   0.089 us    |                native_load_tls();
 ------------------------------------------
 2)    ls-2556     =>    <idle>-0                 <-- Context switch to idle task
 ------------------------------------------    
 2)   0.187 us    |      finish_task_switch();
 1)   0.359 us    |    fget_light();
 ....
 .... 
 ------------------------------------------
 2)    <idle>-0    =>    ls-2556               <-- Context switch to "ls"
 ------------------------------------------

 2)   0.188 us    |                finish_task_switch();
 1)               |    security_file_ioctl() {
 1)   0.066 us    |      cap_file_ioctl();
 ....
 ....
 0)               |        __copy_from_user_ll() {
 0)   0.059 us    |          __copy_from_user_ll.part.1();
 0)   0.399 us    |        }
 0)   1.050 us    |      }
It is easier to understand the implementation using it's code flow. The timing information is very useful for optimizing our code.

Thursday, May 22, 2014

Platform Device Driver - a practical approach - 1


Platform devices are inherently not discover-able, i.e. the hardware cannot say "Hey! I'm present!" to the software. For example PCI and USB are self discover-able, but I2C is not.
In the embedded and system-on-chip world, non-discoverable devices are, if anything, increasing in number. So the kernel still needs to provide ways to be told about the hardware that is actually present. Platform devices have long been used in this role in the kernel.
There are two important components here.
  1. Platform Driver
  2. Platform Device
Platform Driver - Set of operation done on device
Linux kernel defines set of standard operations which will be performed on a platform device.
Refer http://lxr.free-electrons.com/source/include/linux/platform_device.h#L173
At a minimum, the probe() and remove() callbacks must be supplied; the other callbacks have to do with power management and should be provided if they are relevant.
static int sample_drv_probe(struct platform_device *pdev)
{
   //Empty Probe function.
}
static int sample_drv_remove(struct platform_device *pdev)
{
  //Empty remove function.
}

static struct platform_driver sample_pldriver = {
    .probe          = sample_drv_probe,
    .remove         = sample_drv_remove,
    .driver = {
            .name  = DRIVER_NAME,
    },
};
In the above code just make a note of DRIVER_NAME. We discuss bit later.
So now a platform driver with two operations(probe and remove) is ready. This driver should be register with kernel.
NOTE:: I guess, you know that every loadable device driver is basically kernel module. For making our code complete, moving platform driver in to helloworld kernel module.
#include <linux/module.h>
#include <linux/kernel.h>

MODULE_LICENSE("GPL");

int ourinitmodule(void)
{
    printk(KERN_ALERT "\n Welcome to sample Platform driver.... \n");
    return 0;
}

void ourcleanupmodule(void)
{
    printk(KERN_ALERT "\n Thanks....Exiting sample Platform driver... \n");
    return;
}

module_init(ourinitmodule);
module_exit(ourcleanupmodule);
Now platform driver related stuff in kernel module.
#include <linux/module.h>
#include <linux/kernel.h>

//for platform drivers....
#include <linux/platform_device.h>

#define DRIVER_NAME "Sample_Pldrv"

MODULE_LICENSE("GPL");

/**************/ 
static int sample_drv_probe(struct platform_device *pdev){
}
static int sample_drv_remove(struct platform_device *pdev){
}

static struct platform_driver sample_pldriver = {
    .probe          = sample_drv_probe,
    .remove         = sample_drv_remove,
    .driver = {
            .name  = DRIVER_NAME,
    },
};
/**************/  

int ourinitmodule(void)
{
    printk(KERN_ALERT "\n Welcome to sample Platform driver.... \n");

    /* Registering with Kernel */
    platform_driver_register(&sample_pldriver);

    return 0;
}

void ourcleanupmodule(void)
{
    printk(KERN_ALERT "\n Thanks....Exiting sample Platform driver... \n");

    /* Unregistering from Kernel */
    platform_driver_unregister(&sample_pldriver);

    return;
}

module_init(ourinitmodule);
module_exit(ourcleanupmodule);
Now driver named "Sample_Pldrv" is ready with two empty functions.
Platform Device - Information about device
Kernel knows about device's information like IRQ number, memory locations, etc by registering platform device. To operate on this device, we early wrote platform driver right ?? If you want to bind the platform device to a driver, then device must be registered with same name which driver is registered. In our case "Sample_Pldrv".
/* Specifying my resources information */
static struct resource sample_resources[] = {
        {
                .start          = RESOURCE1_START_ADDRESS,
                .end            = RESOURCE1_END_ADDRESS,
                .flags          = IORESOURCE_MEM,
        },
        {
                .start          = RESOURCE2_START_ADDRESS,
                .end            = RESOURCE2_END_ADDRESS,
                .flags          = IORESOURCE_MEM,
        },
    {
                .start          = SAMPLE_DEV_IRQNUM,
                .end            = SAMPLE_DEV_IRQNUM,
                .flags          = IORESOURCE_IRQ,
        }

    };    

static struct platform_device sample_device = {
        .name           = DRIVER_NAME,
        .id             = -1,
        .num_resources  = ARRAY_SIZE(sample_resources),
        .resource       = sample_resources,
};
Here, there are two memory related information and one IRQ information is mentioned. Now you got the role of DRIVER_NAME ??
Either device and driver can be in single module or as separate module. It's your choice. But probe() will be called when both device and driver is available.
Say you loaded only driver. when the device with same name is loaded, the probe() will be called.
Refer https://github.com/jeyaramvrp/kernel-module-programming/tree/master/sample-platform-driver for simple platform driver and device code written as separate kernel module.

Tuesday, May 20, 2014

ftrace - The kernel function tracer - 1



1. Introduction
  • Ftrace is a tracing utility built directly into the Linux kernel.
  • Designed to help out developers and designers of systems to find what is going on inside the kernel.
  • Below mentioned output is executed in ubuntu 12.04 virtual machine.
2. Build kernel with the following configuration(Ubuntu already Configured)
CONFIG_FUNCTION_GRAPH_TRACER
CONFIG_STACK_TRACER
CONFIG_DYNAMIC_FTRACE
CONFIG_FUNCTION_TRACER
3. Mount the debugfs
#mount -t debugfs nodev /sys/kernel/debug
#mount |grep debugfs
5:none on /sys/kernel/debug type debugfs (rw)
4. Enter in to root shell
$ sudo  -i 
You will be asked for your password, then be given a root shell. In that shell, you can cd to /sys/kernel/debug.
5. Sample Execution

# pwd
/sys/kernel/debug/tracing

# cat available_tracers 
blk function_graph mmiotrace wakeup_rt wakeup function nop

# echo function_graph > current_tracer

# cat current_tracer 
function_graph

# cat trace | head -20
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 0) ! 164785.8 us |      } /* native_safe_halt */
 0) ! 164786.9 us |    } /* default_idle.part.4 */
 0) ! 164787.6 us |  } /* default_idle */
 0)   0.062 us    |  local_touch_nmi();
 0)               |  cpuidle_idle_call() {
 0)   0.094 us    |    cpuidle_get_driver();
 0)   0.690 us    |  }
 0)               |  default_idle() {
 0)               |    default_idle.part.4() {
 0)               |      native_safe_halt() {
 0)   ==========> |
 0)               |        do_IRQ() {
 0)               |          irq_enter() {
 0)               |            rcu_irq_enter() {
 0)   0.231 us    |              rcu_exit_nohz();
 0)   0.995 us    |            }