dw_dev_training | Strip tab characters from multiple text files and replace them with spaces (or something else)

I switch between Gedit, Notepad++, and vim fairly often depending on what I'm doing and whose computer I'm on. Sometimes I end up with tab characters where I really wanted four spaces, mainly when I'm using vim and I haven't figured out how to get vim to not do this. Gedit and Notepad++ have settings to use spaces instead of tabs, so there's no issue there.

Either I don't notice the tab characters until after I've put lots of them in the file I'm editing, or I'm editing a file from someone else whose editor uses tab characters for indentation. I know its not a big deal to some people, but tab indentation mixed with space indentation is a huge pet peeve of mine.

Thus, a perl script was born:

#!/usr/bin/perl -w

use File::Copy;

# This script replaces tab characters in a file with four spaces, or whatever else you want

$num_args = $#ARGV + 1;
$num_warnings = 0;
$tab_replace = "    ";  # Change this to whatever you want in place of tabs

if ( $num_args == 0 ) {
    print "usage: strip-tabs.pl file1 [file2, file3...]\n";
    exit;
}

# Have files to parse...
for my $f ( 0 .. ($num_args - 1) ) {
    if ( open(INPUTFILE, "<$ARGV[$f]") ) {
       if ( open(OUTPUTFILE, ">$ARGV[$f]~") ) {
           # Start parsing the file
           while ( my $line = <INPUTFILE> ) {
               $line =~ s/\t/$tab_replace/g;
               print OUTPUTFILE $line;
           }
           # Copy over original file here
           close(OUTPUTFILE);
           close(INPUTFILE);
           if ( !move("$ARGV[$f]~", $ARGV[$f]) ) {
               print "Could not write output to file $ARGV[$f]: $!\n";
               $num_warnings += 1;
           }
       }
       else {
           close(INPUTFILE);
           print "Could not create output file for $ARGV[$f]: $!\n";
           $num_warnings += 1;
       }
    }
    else {
        print "Could not open $ARGV[$f] for reading: $!\n";
        $num_warnings += 1;
    }
}
die "$num_warnings warnings encountered during file operation." unless $num_warnings == 0;

View Gist (strip-tabs.pl)

Feel free to gank away if you find it useful!

Flat | Top-Level Comments Only

Yes, I'm always open for suggestions! :)

OK, here are the two big thoughts I had.

$num_args = $#ARGV + 1;

An array in scalar context evaluates to the number of its elements, so this could be $num_args = @ARGV; instead.

I like to separate $#foo and scalar @foo, and use the former only in contexts where it means "index of the last entry" (for example, in a for loop iterating over the indices of an array) and the latter when I want a number of elements.

(Also, $#foo is sensitive to setting $[, but you shouldn't mess with that variable anyway.)

The other one is that "iterating over the files in @ARGV" is such a common use case that Perl has a shortcut for this.

If you read from the empty filehandle (as in while (<>) with nothing in between the angle brackets), you'll get a line at a time from all of the files in succession. Perl will automatically handle opening them and closing them for you. And if you didn't supply any file names, Perl will read from standard input. (This is a bit like Unix tools such as gzip or grep which will also work on standard input if there are no file name arguments.) See http://perldoc.perl.org/perlop.html#I/O-Operators for more on this. (That also mentions that you can find out which file you're currently on by examining $ARGV, which the magic will set for you appropriately on each new file.)

And if you don't assign <> to anything in the while loop, it'll automatically assign to $_ - which is the default thing that s/// operates on, which can be handy. It's also the default operand for lots of other operations.

So if you were just reading from the files, you could replace the whole "# Have files to parse..." loop with:

while (<>) {
  s/\t/        /;
}

That would just be missing the printing of the changed line and the editing behaviour.

In the one-liner I suggested, these are supplied by the -p and the -i command-line switch, respectively; see http://perldoc.perl.org/perlrun.html for more information on those.

You can also turn on -i inside the program by assigning to the magic variable $^I.

(My apologies for getting to this late.)

I haven't had the chance yet to revisit this yet, but I wanted to thank you for taking the time to reply. Clearly I have a lot to learn still. It seems like it would be worth my while to also learn sed.

Clearly I have a lot to learn still.

Well, there's a lot to learn, but not everyone is expected to know everything :)

Plus, There's More Than One Way To Do It (TMTOWTDI, tim-toady) in Perl.

It seems like it would be worth my while to also learn sed.

It depends, but having more tools in one's personal toolbox is often useful.

If you do learn sed, getting to know the basics of awk may also be useful. (And grep, if you don't know it already, as well as find and xargs, which are useful in connection with it, though less necessary if you have GNU grep.)

Strip tab characters from multiple text files and replace them with spaces (or something else)

no subject

no subject

no subject

no subject