简体   繁体   中英

read and write to file assembly

I have an inputfile.txt which looks like this: 3 4 2 0 8 1 5 3

I'm trying to write inside an outputfile.txt each character of inputfile incremented by 1. So inside outputfile.txt I should see 4 5 3 1 9 2 6 4 . I have tried to write this piece of code but I have several doubts.

.section .data
  buff_size: .long 18

.section .bss
  .lcomm buff, 18

.section  .text              # declaring our .text segment
  .globl _start              # telling where program execution should start

_start:

    popl %eax       # Get the number of arguments
    popl %ebx       # Get the program name
    popl %ebx       # Get the first actual argument - file to read

    # open the file
    movl $5, %eax       # open 
    movl $0, %ecx       # read-only mode
    int $0x80        

    # read the file

    movl $0, %esi
    movl %eax, %ebx     # file_descriptor

    analyzecharacter:   #here I want to read a single character
        movl $3, %eax       
        movl $buff, %edi    
        leal (%esi,%edi,1), %ecx    
        movl $1, %edx
        int $0x80
        add $1, %esi #this point is not clear to me, what I'd like to do is to increment the index of the buffer in order to be positioned on the next cell of buffer array, I've added 1 but I think is not correct
        cmp $8, %esi  # if I've read all 8 characters then I'll exit
        je exit

    openoutputfile:
    popl %ebx       # Get the second actual argument - file to write
    movl $5, %eax       # open
    movl $2, %ecx       # read-only mode
    int $0x80       

    writeinoutputfile:
    #increment by 1 and write the character to STDOUT
    movl %eax, %ebx     # file_descriptor
    movl $4, %eax       
    leal (%esi,%edi,1), %ecx
    add $1, %ecx #increment by 1        
    movl $1, %edx   
    int $0x80
    jmp analyzecharacter        

    exit:
    movl $1, %eax       
    movl $0, %ebx       
    int $0x80   

I have 2 problems/doubts:

1- my first doubt is about this instruction: add $1, %esi . Is this the right way to move through buffer array?

2- The second doubt is: When I analyze each character should I always invoke openoutputfile label? I think that in this way I'm reopening the file and the previous content is overwritten. Indeed if I run the program I see only a single character \\00 (a garbage character, caused by the value of %esi in this instruction I guess: leal (%esi,%edi,1), %ecx ).

I hope my problems are clear, I'm pretty new to assembly and I've spent several hours on this.

FYI: 
I'm using GAS Compiler and the syntax is AT&T.
Moreover I'm on Ubuntu 64 bit and Intel CPU. 

So, how I would do the code... Thinking about it, I'm so used to Intel syntax, that I'm unable to write AT&T source from my head on the web without bugs (and I'm too lazy to actually do the real thing and debug it), so I will try to avoid writing instructions completely and just describe the process, to let you fill up the instructions.

So let's decide you want to do it char by char, version 1 of my source:

start:
  ; verify the command line has enough parameters, if not jump to exitToOs
  ; open both input and output files at the start of the code
processingLoop:
  ; read single char
  ; if no char was read (EOF?), jmp finishProcessing
  ; process it
  ; write it
  jmp processingLoop
finishProcessing:
  ; close both input and output files
exitToOs:
  ; exit back to OS
  • now "run" it in your mind, verify all the major branch points make sense and will handle correctly for all major corner cases.
  • make sure you understand how the code will work, where it will loop, and where and why it will break out of loop.
  • make sure there's no infinite loop, or leaking of resources

After going trough my checklist, there's one subtle problem with this design, it's not rigorously checking file system errors, like failing to open either of the files, or writing the character (but your source doesn't care either). Otherwise I think it should work well.

So let's extend it in version 2 to be more close to real ASM instructions (asterisk marked instructions are by me, so probably with messed syntax, it's up to you to make final version of those):

start:
  ; verify the command line has enough parameters, if not jump to exitToOs
    popl %eax       # Get the number of arguments
   * cmpl $3,eax   ; "./binary fileinput fileoutput" will have $3 here?? Debug!
   * jnz exitToOs

  ; open both input and output files at the start of the code
    movl $5, %eax       # open 
    popl %ebx       # Get the program name

  ; open input file first
    popl %ebx       # Get the first actual argument - file to read
    movl $0, %ecx       # read-only mode
    int $0x80
    cmpl $-1, %eax  ; valid file handle?
    jz exitToOs
   * movl %eax, ($varInputHandle) ; store input file handle to memory

  ; open output file, make it writable, create if not exists
    movl $5, %eax       # open 
    popl %ebx       # Get the second actual argument - file to write
   * ; next two lines should use octal numbers, I hope the syntax is correct
   * movl $0101, %ecx # create flag + write only access (if google is telling me truth)
   * movl $0666, %edx ; permissions for out file as rw-rw-rw-
    int $0x80
    cmpl $-1, %eax  ; valid file handle?
    jz exitToOs
    movl %eax, ($varOutputHandle) ; store output file handle to memory

processingLoop:

  ; read single char to varBuffer
    movl $3, %eax
    movl ($varInputHandle), %ebx
    movl $varBuffer, %ecx
    movl $1, %edx
    int $0x80

  ; if no char was read (EOF?), jmp finishProcessing
    cmpl $0, %eax
    jz finishProcessing ; looks like total success, finish cleanly

  ;TODO process it
   * incb ($varBuffer) ; you wanted this IIRC?

  ; write it
    movl $4, %eax       
    movl ($varOutputHandle), %ebx     # file_descriptor
    movl $varBuffer, %ecx  ; BTW, still set from char read, so just for readability
    movl $1, %edx    ; this one is still set from char read too
    int $0x80

  ; done, go for the next char
    jmp processingLoop

finishProcessing:
    movl $0, ($varExitCode) ; everything went OK, set exit code to 0

exitToOs:
  ; close both input and output files, if any of them is opened
    movl ($varOutputHandle), %ebx     # file_descriptor
    call closeFile
    movl ($varInputHandle), %ebx
    call closeFile

  ; exit back to OS
    movl $1, %eax
    movl ($varExitCode), %ebx
    int $0x80

closeFile:
    cmpl $-1, %ebx
    ret z ; file not opened, just ret
    movl $6, %eax  ; sys_close
    int $0x80
    ; returns 0 when OK, or -1 in case of error, but no handling here
    ret

.data
varExitCode: dd 1 ; no idea about AT&T syntax, "dd" is "define dword" in NASM
  ; default value for exit code is "1" (some error)
varInputHandle: dd -1 ; default = invalid handle
varOutputHandle: dd -1 ; default = invalid handle
varBuffer: db ? ; (single byte buffer)

Whoa, I actually wrote it fully? (of course it needs the syntax check + cleanup of asterisks, and ";" for comments, etc...)

But I mean, the comments from version 1 were already so detailed, that each required only handful of ASM instructions, so it was not that difficult (although now I see I did submit the first answer 53min ago, so this was about ~1h of work for me (including googling and a bit of other errands elsewhere)).

And I absolutely don't get how some human may want to use AT&T syntax, which is so ridiculously verbose. I can easily understand why the GCC is using it, for compilers this is perfectly fine.

But maybe you should check NASM, which is "human" oriented (to write only as few syntax sugar, as possible, and focus on instructions). The major problem (or advantage in my opinion) with NASM is Intel syntax, eg MOV eax,ebx puts number ebx into eax , which is Intels fault, taking LD syntax from other microprocessors manufacturers, ignoring the LD = load meaning, and changing it to MOV = move to not blatantly copy the instruction set.

Then again, I have absolutely no idea why ADD $1,%eax is the correct way in AT&T (instead of eax,1 order), and I don't even want to know, but it doesn't make any sense to me (the reversed MOV makes at least some sense due to LD origins of Intel's MOV syntax).

OTOH I can relate to cmp $number,%reg since I started to use "yoda" formatting in C++ to avoid variable value changes by accident in if (compare: if (0 = variable) vs if (variable = 0) , both having typo = instead of wanted == .. the "yoda" one will not compile even with warnings OFF).

But ... oh.. this is my last AT&T ASM answer for this week, it annoys hell out of me. (I know this is personal preference, but all those additional $ and % annoys me just as much, as the reversed order).


Please, I spend serious amount of time writing this. Try to spend serious time studying it, and trying to understand it. If confused, ask in comments, but it would be pitiful waste of our time, if you would completely miss the point and not learn anything useful from this. :) So keep on.


Final note: and search hard for some debugger, find something what suits you well (probably some visual one like old "TD" from Borland in DOS days would be super nice for newcomer), but it's absolutely essential for you to improve quickly, to be able to step instruction by instruction over the code, and watch how the registers and memory content do change values. Really, if you would be able to debug your own code, you would soon realize you are reading second character from wrong file handle in %ebx ... (at least I hope so).

Just to clear 1) early: add $1, %esi is indeed equivalent to inc %esi .

While you are learning assembler, I would go for the inc variant, so you don't forget about its existence and get used to it. Back in 286-586 times it would be also faster to execute, today the add is used instead - because of the complexity of micro architecture (μops), where inc is tiny fraction more complicated for CPU (translating it back to add μops I guess, but you shouldn't worry about this while learning basics, aim rather for "human" readability of the source, do not any performance tricks yet).

Is it the right way?

Well, you should firstly decide whether you want to parse it per character (or rather go for byte , as character is nowadays often utf8 glyph, which can have size from 1 to 6 or how many bytes; I'm not even sure) OR to process it with buffers.

Your mix of the two is making it easy to do additional mistakes in the code.

From a quick look I see:

  • you read only single byte per syscall, yet you store it at new place in buffer+counter (why? Just use single byte buffer, if you work per byte)
  • when counter is 8, you exit (not processing the 8th read byte at all).
  • you lose forever your input file descriptor after opening output file first time by popl %ebx (leaking file handles is very bad)
  • then second char is read from output file (reusing the file handle from write)
  • then you popl %ebx again, but there's no third parameter on command line, ie you fetch undefined memory from stack
  • indeed you reopen the output file each time, so unless it's in append mode, it will overwrite content.

That's probably all major blunders you did, but that's actually so many, that I would suggest you to start over from scratch.

I will try to do a quick my version in next answer (as this one is getting a bit long), to show you how I would do it. But at first please try (hard) to find all the points I did highlight above, and understand how your code works. If you will fully understand what your instructions do, and why they really do the error I described, you will have much easier time to design your next code, plus debugging it. So the more of the points you will really find, and fully understand, the better for you.


"BTW notes":

I never did linux asm programming (I'm now itching to do something after reading about your effort), but from some wiki about system calls I read:

All registers are preserved during the syscall.

Except return value in eax of course.

Keep this in mind, it may save you some hassle with repeating register setup before call, if you group syscalls appropriately.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM