Executor Pool with PTHREADS in ANSI C

Question

I am writing a program in ANSI C (1972) and i have to use fixed amount of threads. I am basically read a big file with records like a .csv with latitude and longitude data and i have to process them. The problem is that i cannot wait 2 weeks on a 100.000.000 lines file, and i need to use threads or forking .

Basically i read the .txt file like this

FILE *file2 = fopen ( lat_long_file, "r" );
if (file2 != NULL)
{
    char line2 [128];

    while (fgets(line2, sizeof line2, file2) != NULL)
    {
        //fputs(line2, stdout);

        char *this_record = trimqq(line2);

        // .....
        // ..... STUFF TO DO (here i must send data to thread function like in JAVA)
        // Thread temp_thread = new Thread(new ThreadClass(arguments ....));
        // temp_thread.start(); <- this is how i would do if i was programming in JAVA
        // .....

    }
}

main_1.c (threading with pthread.h )

#include <pthread.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

#define NUM_THREADS     10

static int current_threads = 0;


void *wait(void *t)
{
   int i;
   long tid;

   tid = (long)t;

   // sleep(1);

   system("sleep 3; date;");

   printf("Sleeping in thread\n");
   printf("Thread with id %lu  ...exiting\n",tid);

   pthread_exit(NULL);
}

int main ()
{
   int rc;
   int i;
   pthread_t threads[NUM_THREADS];
   pthread_attr_t attr;
   void *status;

   // Initialize and set thread joinable
   pthread_attr_init(&attr);
   pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

   for( i=0; i < NUM_THREADS; i++ )
   {
     // cout << "main() : creating thread, " << i << endl;
      rc = pthread_create(&threads[i], NULL, wait, (void *)(intptr_t)i );

      if (rc)
      {
        // cout << "Error:unable to create thread," << rc << endl;
         exit(-1);
      }
   }

    // free attribute and wait for the other threads
    pthread_attr_destroy(&attr);
    for( i=0; i < NUM_THREADS; i++ )
    {
        rc = pthread_join(threads[i], &status);
        if (rc)
        {
            printf("Error:unable to join %d\n",rc);
            exit(-1);
        }

        printf("Main: completed thread id : %d",i);
        printf(" exiting with status : %p\n",status);
    }

    printf("Main: program exiting.\n");

    pthread_exit(NULL);
}

The output i am getting to this id

Sleeping in thread
Sleeping in thread
Thread with id 5  ...exiting
Sleeping in thread
Thread with id 0  ...exiting
Sleeping in thread
Sleeping in thread
Sleeping in thread
Thread with id 9  ...exiting
Thread with id 1  ...exiting
Sleeping in thread
Sleeping in thread
Thread with id 7  ...exiting
Thread with id 3  ...exiting
Thread with id 2  ...exiting
Thread with id 6  ...exiting
Sleeping in thread
Thread with id 4  ...exiting
Sleeping in thread
Thread with id 8  ...exiting
Main: completed thread id : 0 exiting with status : (nil)
Main: completed thread id : 1 exiting with status : (nil)
Main: completed thread id : 2 exiting with status : (nil)
Main: completed thread id : 3 exiting with status : (nil)
Main: completed thread id : 4 exiting with status : (nil)
Main: completed thread id : 5 exiting with status : (nil)
Main: completed thread id : 6 exiting with status : (nil)
Main: completed thread id : 7 exiting with status : (nil)
Main: completed thread id : 8 exiting with status : (nil)
Main: completed thread id : 9 exiting with status : (nil)
Main: program exiting.

And execution time is 3 seconds

if i change system("sleep 3; date;"); to system("sleep 10; date;"); , execution time will be 10 seconds, while i expect to sleep at every call of the void *wait(void *t) function ...

main_2_fork (i also tried fork, but no use)

#include  <stdio.h>
#include  <string.h>
#include  <sys/types.h>
#include <stdlib.h>

#define   MAX_COUNT  200
#define   BUF_SIZE   100

int random_number(int min_num, int max_num);

void  main(void)
{
    int numforks = 0;
    int maxf = 5;
    int status;

    char   buf[BUF_SIZE];

    pid_t PID; 

    int job = 0;
    for(job; job <= 10; job++)
    {
        // fork() = make a copy of this program from this line to the bottom
        PID = fork();

        int fork_id = random_number(1000000,9999999);

        if (PID < 0) 
        {
            // if -1 then couldn't fork ....
            fprintf(stderr, "[!] Couldn't fork!\n");
            exit(1);
        }
        if (( PID == 0 ))
        {
            // 0 = has created a child process
            exit(0);
        }
        else            
        {
            // means that PID is 1 2 3 .... 30000 44534 534634 .... whatever
            // increment the fork count
            numforks++;

            sprintf(buf, "FORK[#%d] BEGIN pid=%d num_forks=%d\n",fork_id,PID,numforks);
            write(1, buf, strlen(buf));

            // sleep(random_number(1,2));

            char str[300];
            sprintf(str,"sleep %d; ps ax | wc -l",random_number(1,4));
            puts(str);

            // OUTPUT COMMAND BEGIN
            FILE *command_execute = popen(str, "r");
            char buf[256];
            int increment = 0;
            while (fgets(buf, sizeof(buf), command_execute) != 0)
            {
                printf("LINE[%d]:%s",increment,buf);
                increment++;
                break;
            }
            pclose(command_execute);
            // OUTPUT COMMAND END   

            // block to not do extra forks
            if (numforks > maxf)
            {
                for (numforks; numforks > maxf; numforks--)
                {
                    PID = wait(&status);
                }
            }

            sprintf(buf, "FORK[#%d] END pid=%d num_forks=%d\n",fork_id,PID,numforks);
            write(1, buf, strlen(buf));
        }

        // sleep(1);
    }
}

int random_number(int min_num, int max_num)
{
    int result=0,low_num=0,hi_num=0;
    if(min_num<max_num)
    {
        low_num=min_num;
        hi_num=max_num+1; // this is done to include max_num in output.
    }
    else
    {
        low_num=max_num+1;// this is done to include max_num in output.
        hi_num=min_num;
    }
    srand(time(NULL));
    result = (rand()%(hi_num-low_num))+low_num;
    return result;
}

the output is :

FORK[#7495656] BEGIN pid=29291 num_forks=1
sleep 1; ps ax | wc -l
LINE[0]:312
FORK[#7495656] END pid=29291 num_forks=1
FORK[#9071759] BEGIN pid=29296 num_forks=2
sleep 4; ps ax | wc -l
LINE[0]:319
FORK[#9071759] END pid=29296 num_forks=2
FORK[#2236079] BEGIN pid=29330 num_forks=3
sleep 4; ps ax | wc -l

......

And the execution is not parallel ... rather it is executing one by one, even though i se that the fork() function si creating child processes in ps ax | grep 'fork2.exe' ps ax | grep 'fork2.exe' ...

Here is an example with what i want : http://www.javacodegeeks.com/2013/01/java-thread-pool-example-using-executors-and-threadpoolexecutor.html

Where you put let's say 5 to be the maximum threads at a time.

QUESTIONS

Why void *wait(void *t) function is not sleeping properly? Why pthread is executing them one by one rather then in parallel ?
What should i do to make thread pool with fixed maximum threads in C ?

Thank you very much.

Answer 1

I cannot comment yet so I'll reply here. Your threaded example takes exactly the amount of time one thread (your wait() function) sleeps. This said, it'd have been clearer if you wrote it this way:

void *some_running_task(void *t)
{
   int i;
   long tid = (long)t;

   printf("Sleeping in thread #%lu ...\n", tid);
   system("sleep 3; date;");

   printf("Thread with #%lu ... exiting\n", tid);
   pthread_exit(NULL);
}

As @fuzxxl says, there is a wait() in standard thread libraries so you should not use it.

All your threads start at the same instant, to a few tens of microseconds maybe. They all start at the same moment hence all end 3 seconds later. Change the sleep instruction to 10 seconds and your program lasts 10 seconds.

What you probably want is a pool of threads that constantly keeps the same number of threads busy until the whole work is done: fire a thread until you reach the maximum pool count for as long as there is data to process . Synchronising a thread pool is prone to deadlocks though. You might as well have each thread process its own section of the file... unless what you want is dedicate a thread to a single line.

One issue I see with parallelism here is sequence. If you care about the sequence order , the threads will not necessarily yield data in the same order you read lines. So unless you put the processed data along with the line number in a database, you will lose the sequence order.

Another issue I see pointing in is outputting processed data. It requires proper synchronisation to avoid one thread output not to mess another one's ( iif threads are supposed to print out their data, of course ).

It's a little unclear to me what you expect from parallelism here — apart from speeding up the global processing time. If you want a bunch of threads to process a bunch of lines you'll come up anyway with something similar and as simple as splitting your source data file... if it can be done at all of course. But at least you can control the sequence of data as you read each line and you can then fall back on firing long-running single-threaded processes instead of a long running multi-threaded application. Single-threaded applications are easier to program than multi-threaded ones.

Is also the use of C mandatory that you cannot use, say, Python or Cython ? The biggest advantage is sparing you the hassle of thread synchronisation.

Anyway there are more than one way to speed up linear data processing. For instance UNIX sed can be used to pipe a certain amount of lines to a processing application. Run as many sed | <processing app> sed | <processing app> as you need. Or you might just pipe split portions of your data file into a processing application written in C or Python.

Just giving headlines.

Executor Pool with PTHREADS in ANSI C

Question

1 answers

solution1
1

Executor Pool with PTHREADS in ANSI C

Question

1 answers

solution1 1

solution1
1