Linguistics Miscellany

Friday, July 4, 2014

Blog is Moving Location

While I have enjoyed using the Blogspot/Blogger platform, I have for a long time wanted a platform that would give me better control of my content and not make me worry about storage. The best option for me is a GitHub-based blog, so I have switched to using Octopress. You can now find me at http://garfieldnate.github.io. Hope to see you there!

Saturday, January 11, 2014

List Assignment in Scalar Context

(Cross-posted on blogs.perl.org)

This week I received some special help on SO in understanding how the goatse operator works. I was very thankful for everyone's help. These two articles were also very helpful and I recommend reading them.
Part of my confusion over the goatse operator was not knowing the difference between list and scalar assignment operators, which both are indicated via '='. Further confusing is the fact that each can be used in either scalar or list context, so you can have list assignment in scalar context or scalar assignment in list context.
The type of assignment is determined by what is being assigned to. As ikegami says, assignment to an aggregate is a list assignment, aggregate meaning an array, a hash, a parenthetical expression, or a my/our/local variable declared with parens.
The context of an assignment operator will really only matter when you are storing or checking the return value. You can store the value of an assignment operator by using another asignment operator: blah1 = blah2 = blah3, where blah1 is the value returned by assigning blah3 to blah2. The value gets checked in other contexts too, like inside a control structure condition: if(my $line = <>), etc. Here are examples for each combination of context and assignment operator:

# scalar assignment in scalar context
$thing = ($foo = 'bar'); # assignment returns $foo as lvalue
say $thing; # bar
# scalar assignment in list context
($thing) = ($foo = 'bar'); #assignment returns ($foo), $foo is lvalue
say $thing; # bar
# list assignment in scalar context;
# assignment returns number of items in RHS of list assignment
$thing = (($foo, $bar) = qw(foo bar));
say $thing; # 2
$thing = (() = qw(foo bar))
say $thing; # 2
$thing = () = qw(foo bar);
say $thing; # 2
# list assignment in list context
# assignment returns LHS list as lvalues
($thing) = (($foo, $bar) = qw(foo bar));
say $thing; # foo
($thing) = (() = qw(foo bar));
say $thing; # nothing ($thing is undef)

That third one is of course the goatse operator. By the way for the record I totally think it looks more like a Saturn, though my wife disagrees and everyone seems to call it goatse. Anyway, though generally list assignment in scalar context is the rarest one, there are other occurrences. Ysth mentions the each operator inside of a while loop:

while(my ($key, $value) = each %hash)

The aggregate on the left makes this list assignment, and while makes it scalar context. Once the hash is out of keys, each returns () so that the assignment operator returns 0, finishing the while loop.
I was pretty happy to finally understand this area I never quite understood I didn't understand (though someone might still point out I don't know what I'm talking about, as seems to be common with this subject). Today, though, I thought of one more usage of list assignment in scalar context that is probably used erroneously fairly often: quick and dirty parameter checking:

my ($input, $output) = @ARGV or die 'Usage: script <input> <output>';

I always thought that the assignment would return $output, probably by analogy with comma expression assignment to a scalar ($stuff = qw(foo bar)). However, if the user fails to provide a second parameter, the error would not be caught. This assignment will return the number of elements in @ARGV, which could be 1 instead of the required 2. So this use is only correct when unpacking @_ or @ARGV and expecting exactly one variable:

my ($input) = @ARGV or die 'Usage: script <input>';

This is probably obvious to Perl old-timers, but to me it was a revelation. And it doesn't look like I'm the only one, either. Grepping CPAN for assignment of an array to a parenthetical with 'or' after it turns up many mis-uses here.

Tuesday, August 27, 2013

Packaging XML::LibXML with PAR Packer on Windows

PAR Packer is an excellent utility for delivering your Perl scripts as standalone executables. A standalone executable is highly desired in, for example, a corporate environment where everyone needs a program you wrote but you can't expect anyone to learn how to run Perl programs.

A recent requirement at $work was for a standalone executable. Originally, I was supposed to let my coworker work his magic (and his ActiveState PerlPacker license), but the client required an all-open-source solution. Thus I turned to PAR Packer and its pp utility.

So far, the most difficult aspect of using pp is that it doesn't detect all dependencies. It requires the user to explicitly list many required DLL's. I needed to list DLL's for two libraries: Wx and XML::LibXML.

Creating Wx apps with pp is a solved problem: wxpar, bundled with Wx::Perl::Packager, is a pp wrapper and adds all of the required Wx DLL's.

Getting it to work with XML::LibXML required some trial and error. I would create the executable, move it to another computer without Perl or C, run it from the command line (clicking the file hid certain error messages), and write down the name of the library that was missing. It turned out that three DLL's needed to be explicitly added: libxml2-2__.dll, libiconv-2__.dll and libz__.dll. On my computer these were located in C:\strawberry\c\bin. So, the final command I used to build my application was thus:

wxpar -o MyApp.exe -I lib -l C:/strawberry/c/bin/libxml2-2__.dll -l C:/strawberry/c/bin/libiconv-2__.dll -l C:/strawberry/c/bin/libz__.dll MyApp.pl

Is there a simpler way to do this? What's with all the underscores? Comments and questions welcome below.

Monday, April 15, 2013

The Extended Euclidian Algorithm in Perl

This week I learned about the extended Euclidian algorithm for finding a linear combination of two numbers that yields their GCD. For example, the GCD of 213 and 171 is 3, and -4*213 + 5*171 = 3. This algorithm is important in the RSA encryption scheme.

I had quite a difficult time getting myself to fully understand how it works. I jumped between Wikipedia, my data structures textbook (don't buy it), a YouTube video, and this excellent number theory class lecture.The lecture is the best, though I think there may be a typographical error in the recursive formula.

The basic idea uses recursion with an easy base step. We call Euclid(a,b) with a ≥ b:

The base case is when b is 0. The GCD of x and 0 is always x, and the coefficients to produce a GCD of 0 are 1 and 0 (or anything else): 1*x + 0(or anything)*0 = x. So the base case returns (1,0)
Any other step starts by recursively calling Euclid(b, a mod b). We know that the GCD of a and b is the same as the GCD of b and a mod b (lemma 12 in the lecture). This recursive call is guaranteed to eventually get to the base case of b = 0.
After finding the coefficients for producing the GCD from b and a mod b, we can calculate the ones for producing the GCD from a and b, because a mod b can be put in terms of a and b (see the code comments for the formulas).

To really help myself understand the whole thing, I wrote a Perl script to illustrate it. I put in lots of comments as I worked my way through it.

use strict;
use strict;
use warnings;
use 5.010;
#start with a >= b
my @nums = sort {$b <=> $a} @ARGV;

gcd(@nums);

#input: two numbers (a,b) a >= b > 0
#output: the coefficients which which yield their GCD; 
sub gcd {
 my ($a, $b) = @_;
 
 #base case; the GCD of x and 0 always x;

 #and the coefficients will always be 1 and 0 (or anything) because
 #1*x + 0*0 = x
if($b == 0){
  say "GCD is $a";
  say "(a,b) = ($a,$b), coefficients = (1,0)";
  say "1x$a + 0x$b = $a";
  return (1, 0);
 }
 
 #otherwise, we evaluate u and v for k = ub + vr, where r is a mod b
 #gcd(b, a%b) gives the same value
 my $remainder = $a % $b;
 my ($u, $v) = gcd($b, $remainder);
 #now we can find k in terms of a and b because we know r in terms a and b
 #r = a - bq, where q = the whole part of a/b
 #k = ub + vr = ub + v(a - bq) = va + b(u-qv)
 #so the coefficient on a is v, and the coefficient on b is 1-qv
 my $x = $v;
 my $q = int(($a/$b));
 my $y = $u - $q*$x;
 say "(a,b) = ($a,$b), coefficients are ($x,$y)";
 say "${x}x$a + ${y}x$b = " . ($x*$a + $y*$b);
 return ($x, $y);
}

Feel free to leave a comment if you think that something could be stated more clearly. I hope it helps anyone else trying to learn how the extended Euclidian algorithm works.

Sunday, April 7, 2013

Running Perl with Sublime Text 2

I've been having fun trying out Sublime Text. It's pretty, fast, and extremely extensible.

The first thing that I wanted was to be able to work well with Perl. I installed Package Control, followed by SublimeLinter, which has the perlcritic command built in. Making this useful requires a little finagling; perlcritic is by no means a quick program (being a really thorough linter for a language which is complex to parse), and the defaults for SublimeLinter cause it run over and over again as you type. To fix this, I edited Packages/SublimeLinter/SublimeLinter.sublime-settings and changed the "sublimelinter" setting to false. Now, in order to lint the current file, I have to press ctrl+alt+l. (Update: I don't recommend this for Sublime Text 2 because of speed problems. See this issue on Github. ST3 should be fine, though.)

Next, I wanted to be able to run my Perl scripts. Sublime has the ctrl+b shortcut for running a build for the current file. What the build actually does is specified in either a build file or the project file. To create a new build file for perl, go to Tools->Build System -> New Build System. The build file I've seen on different sites for Perl looks like this:

{ "cmd": ["perl", "$file"], "file_regex": ".* at (.*) line ([0-9]*)", "selector": "source.perl" }

Save this as perl.sublime-build. With this, whenever you are working on a Perl file and hit ctrl+b, the command "perl -w your_file.pl" will be run. This, however, was not good enough for me. Most of the time I am working on tests for a Perl module, so I have to run perl -Ilib t/my_test_file.t. I also want to be able to run individual tests as well as prove using shortcuts.

To do this, we need to turn the module directory into a Sublime Text project. This is pretty simple. First, open the module directory in Sublime Text. Select Project->Save Project As, then choose the name of the project and save it in the top directory of the module. Paste the following simple contents into the project file:

{ "folders": [ { "path": "." } ] }

All this does is add the entire directory to the project. Next, we edit the Perl build file to reference the root of the project so we can add the top-level lib directory to our include path:

{ "cmd": ["perl", "-Ilib", "$file"], "working_dir": "$project_path", "file_regex": ".* at (.) line ([0-9])", "selector": "source.perl",

}

Great! Now we can run Perl on tests contained in module directories. This still works fine for standalone scripts, too.

Now I'd like to run my whole test suite using prove. By default, ctrl+shift+b runs a build variant with the name "Run", so we'll just make a prove variant with that name. I'd much rather give it a more descriptive name, but the Sublime shortcut requires this name. You can change the shortcut, but then you wouldn't be able to use the shortcut for other builds (other languages). It's all up to you. Here is the final build file:

{ "cmd": ["perl", "-Ilib", "$file"], "working_dir": "$project_path", "file_regex": ".* at (.) line ([0-9])", "selector": "source.perl", "variants": [ { "cmd": ["prove", "-vlr", "--merge"],

"working_dir": "$project_path",

"name": "Run", "windows": {

"cmd": ["prove.bat", "-vlr", "--merge"]

}

} ] }

Note that I needed a Windows variant for prove since the Sublime editor doesn't work the same as cmd. You could, alternatively, add '"shell":true' to use the system's command shell so you don't need a separate command for Windows.

With this build file in place, I can now press ctrl+b to run any Perl script, with it's project lib directory in @INC, and ctrl+shift+b to run prove. Voila!

Here are the final files:
project file (put a copy in your project root folders)
Perl build file (only one is needed per ST installation)

Sunday, February 3, 2013

Managing Global State: the Flip-Flop Operator

Today I was faced with another mysterious failing test while writing a test suite for some legacy code. I knew it had to be a problem with persisting state because this particular test only failed when processing a particular data set with the same object which was just used to process another set.

My first step to trying to fix this was to delete all of the values stored in the object during the processing procedure:

delete $self->{stateDatum1};

delete $self->{stateDatum2};

#etc....

Nothing changed. I reduced the problematic code into a small example for this post. First, the module to be tested:

package Demo::Bad::GlobalFlipFlop;
use strict;
use warnings;
use autodie;
use 5.010;

sub new {
 my ($class) = @_;
 my $self = {};
 bless $self, $class;
 return $self;
}

#return true if parsing succeeded, false otherwise.
sub parse {
 my ($self, $file) = @_;
 open my $file_in, '<', $file;
 
 my $started = 0;
 while( <$file_in> ){
  
  #flip-flop 
  next unless /^=startHere/i .. 0;    # start processing
  $started = 1;
  #continue doing something with file contents...
  # say 'hello:)' if(/hello/);
  # say 'goodbye:(' if(/goodbye/);
 }
 if(not $started){
  say "File not processed; missing '=startHere' line.";
  return;
 }
 close $file_in;
 return 1;
}

1;

The main idea here is that we are processing some file and returning a boolean representing its validity. The only requirement of validity of the file is that a certain start token is found within it; everything before the start sequence is ignored. Here are valid and invalid example files:

#good_file.txt
=startHere
hello
goodbye

#bad_file.txt- doesn't contain a start sequence
hello
goodbye

Now, the test file:

use strict;
use warnings;
use autodie;
use Test::More tests => 2;
use File::Slurp;
use Demo::Bad::GlobalFlipFlop;

my $good_name = 'good_file.txt';
my $bad_file = 'bad_file.txt';

my $demo = Demo::Bad::GlobalFlipFlop->new();

ok( $demo->parse($good_name) );
ok( not $demo->parse($bad_file) );

The output of running this file:

>perl test.pl
1..2
hello:)
goodbye:(
ok 1
hello:)
goodbye:(
not ok 2
# Failed test at test.pl line 61.
# Looks like you failed 1 test of 2.

Why did it fail the second test, which involves checking that an invalid file is considered invalid?
The bug is in the line which matches the start token:

next unless /^=startHere/i .. 0;    # start processing

The regex, flip-flop operator and 0 were clearly some sort of idiom that I was unfamiliar with. I had only ever used the flip-flop with numbers, such as 1..10, which iterates from numbers 1 through 10. How does it work? Let's check perlop:

Each ".." operator maintains its own boolean state, even across calls to a subroutine that contains it. It is false as long as its left operand is false. Once the left operand is true, the range operator stays true until the right operand is true, AFTER which the range operator becomes false again.

The mysterious line thus worked like this:

Skip lines of the input file until the left side, a match for the start token, is true
Don't skip lines again until the right side, 0, is evaluated as true (which never happens).
The state of this flip-flop operator is stored between subsequent calls to the subroutine. It's a hidden global variable!

Usually flip-flop operators are used in contexts that are guaranteed a reset after iteration (such as 1..10). Not so here! I replaced the offending code with some that keeps state for me:

my $started = 0;

while(<$file_in>){

    if(/^=startHere/i){

        $started = 1;

    }

    next unless $started;

#continue processing...

With this, everything works as expected:

>perl test.pl
1..2
ok 1
File not processed; missing '=startHere' line.
ok 2

Note that this bug only presented itself to me because I changed the legacy standalone script to be its own module, creating the possibility of storing state between subroutine calls.

Sunday, January 27, 2013

When not to use Perl's Implicit close; Suffering from Buffering

This post is a quick not on a bug I had difficulty tracking down.

One nice feature of Perl, introduced long before my time, is that of implicit closing. Perl closes filehandles for you when you forget (maybe on purpose). So the following is not a resource leak as a standalone script:

open my $file, '>utf8', '/path/to/new/file'

    or die "couldn't open file: $!";

print $file 'Hello!';

When the script finishes, Perl will close $file for you, so you can be nice and lazy. The caveat to this is that the variable $. isn't reset as it would be with a normal close (see docs here). $. holds the current line number from the last file read. So if you were processing a file line-by-line and found an error, you might print an error like 'bad value foo on line XYZ' using the $. variable for XYZ. I raised a question about this on StackOverflow.

Today I found another case where not explicitly closing a filehandle means trouble. I was working on testing a modulino-style script with flexible outputs. You can call a method to set the handle that this script prints to. In my test script, I was setting the handle to be some filehandle and then checking the contents of the file against a string. The problem? The file was always empty at run time, but contained what I expected it to when I manually inspected it. Here's some example broken code:

#ImplicitClose.pm

package Demo::Bad::ImplicitClose;

use strict;

use warnings;

sub new {

 my ($class) = @_;

 my $self = {};

 bless $self, $class;

 return $self;

}

sub output_fh {

    my ( $self, $fh ) = @_;

    if ($fh) {

        if ( ref($fh) eq 'GLOB' ) {

            $self->{output_fh} = $fh;

        }

        else {

            open my $fh2, '>', $fh or die "Couldn't open $fh";

            $self->{output_fh} = $fh2;

        }

    }

    $self->{output_fh};

}

sub some_long_method {

 my ($self, $text) = @_;

 print { $self->{output_fh} } $text;

}

1;

#test.pl

use strict;

use warnings;

use autodie;

use Test::More tests => 1;

use File::Slurp;

use Demo::Bad::ImplicitClose;

my $file_name = 'file1.txt';

#make sure we pass the test from outputting something *this* run

unlink $file_name if -e $file_name;

my $print = 'some junk';

my $demo = Demo::Bad::ImplicitClose->new();

$demo->output_fh($file_name);

$demo->some_long_method($print);

my $contents = read_file($file_name);

is($contents, $print);

If you run test.pl, you'll see that its one and only test fails:

>perl -I[folder where you put the Demo directory] test.pl

1..1

not ok 1

# Failed test at test.pl line 68.

# got: ''

# expected: 'some junk'

# Looks like you failed 1 test of 1.

Then, when you inspect the contents of file1.txt, you have:
some junk

What happened here? I was suffering from buffering. Because neither test.pl nor ImplicitClose.pm closed the file, it was still open when I was trying to read it. Nothing had been written to it yet because the amount printed was so small that it had to wait in the buffer either until there was more to write or until the file was closed, which would flush the buffer. Implicit close wouldn't be performed until the the filehandle's reference count reached 0, and the $demo object still had a reference to it. So the test would have worked fine if I had assigned undef to $demo, or just closed the filehandle.

Watch those implicit closes.