Read from buffer C

Question

I am trying to create a simple c program that strips the HTML from a webpage and keeps the text. So far i have come up with the code below. It uses cURL to get the contents of the webpage and write it to a file. How do i go through the memory buffer and remove all HTML tags and output to text to either the terminal or a file?

#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>
#define WEBPAGE_URL "http://homepages.paradise.net.nz/adrianfu/index.html"
#define DESTINATION_FILE "/home/acwest/data.txt"

size_t write_data( void *ptr, size_t size, size_t nmeb, void *stream)
{
 return fwrite(ptr,size,nmeb,stream);
}

int main()
{
 int in_tag = 0;
 char * buffer;
 char c;
 long lSize;
 size_t result;

 FILE * file = fopen(DESTINATION_FILE,"w+");
 if (file==NULL) {
fputs ("File error",stderr); 
exit (1);
}

 CURL *handle = curl_easy_init();
 curl_easy_setopt(handle,CURLOPT_URL,WEBPAGE_URL); /*Using the http protocol*/
 curl_easy_setopt(handle,CURLOPT_WRITEFUNCTION, write_data);
 curl_easy_setopt(handle,CURLOPT_WRITEDATA, file);
 curl_easy_perform(handle);
 curl_easy_cleanup(handle);

 // obtain file size:
 fseek (file, 0, SEEK_END);
 lSize = ftell (file);
 rewind (file);

 // allocate memory to contain the whole file:
 buffer = (char*) malloc (sizeof(char)*lSize);
 if (buffer == NULL) {
fputs ("Memory error",stderr); 
exit (2);
}

 // copy the file into the buffer:
 result = fread (buffer,1,lSize,file);
 if (result != lSize) {
fputs ("Reading error",stderr); 
exit (3);
}
}

Answer 1

Curl will not help you with parsing HTML, and it is a complicated task. You can read the language specification and write a parser. There's an open source C++ project at http://www.mbayer.de/html2text/ or a python script at https://github.com/aaronsw/html2text . You can also install and use html2text from the command-line or execute it from your c code.

Read from buffer C

Question

1 answers

solution1
0 ACCPTED 2012-02-25 11:08:46

Read from buffer C

Question

1 answers

solution1 0 ACCPTED 2012-02-25 11:08:46

solution1
0 ACCPTED 2012-02-25 11:08:46