So "cleaning up bad MS-HTML" did not include "unlink $crapfile". Your patience and tolerance is astounding :-)<br><br><div class="gmail_quote">On Sun, Mar 30, 2008 at 6:19 PM, Mike Harrison <<a href="mailto:meuon@geeklabs.com">meuon@geeklabs.com</a>> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Jim<br>
<div class="Ih2E3d">> I need to update about 43k files and sed just won't cut it for this<br>
> task. What I need to do is replace 2 lines with 4 new ones, and the<br>
> lines contain URLs (backslashes, brackets, etc.). What I would like<br>
> to do is put the new text in a file and pass it and the search text to<br>
> some program that will modify all the files. Any ideas on whats<br>
> available to do that?<br>
<br>
</div>I've not done as much of this as I used to fixing mailQ's and such<br>
at an ISP, but I always ended up doing it in PERL.<br>
Often with a switch for doing 10 files, writing the changed files<br>
in /tmp so I could manually verify them before bulk changing hundreds of<br>
thousands (or more) files. I'm not as good with find/sed/awk, but one of<br>
the reasons I was doing things like this on Perl is it worked well<br>
when there were lots of files in a single directory, and shell scripting<br>
couldn't handle the lists of files well.<br>
<br>
I also often found it easier to write and debug complex regex's in perl<br>
as several steps. Regex's are incredible, and powerful,<br>
and really easy to do things that you didn't realize with exceptions.<br>
<br>
I don't have my old perl scripts from those days,<br>
<br>
But they all had something like what is below (which cleans up bad MS-HTML):<br>
(note, the character encoding in the regex's didn't cut and past well into e-mail:<br>
-------------------------------------------------------------------------------------------<br>
opendir(INC,"$dd") ;<br>
print "Opening: $dd" ;<br>
@incfiles = readdir(INC) ;<br>
closedir INC ;<br>
foreach(sort @incfiles) {<br>
if(/^\./ ) { } else {<br>
if(/(.*).html/ ) {<br>
$file = $_ ;<br>
fixheader($file) ;<br>
#sleep 1 ; # let the server breath. Optional.<br>
} ;<br>
};<br>
} ;<br>
<br>
sub fixheader($file) {<br>
$page = '' ;<br>
$body = 'F' ;<br>
open(IN,"$dd/$file") ;<br>
while(<IN>) {<br>
if(/\<body/) { $body = "T" ; } ; #don't process headers..<br>
if($body eq "T") {<br>
$page .= $_ ;<br>
} ;<br>
} ; # end while IN<br>
close IN ;<br>
$page =~ s/M//g ; #deletes cr's<br>
$page =~ s/\&\#13;/[\[P\]\]/g ; #turns encoded CR's into <P><br>
$page =~ s/\U/\[[li]]/g ; # NOTE X is Magic Char 95. Turns into bullets/listed items<br>
$page =~ s/\n//g ; # deletes lf's<br>
#lots more of these..<br>
open(OUT,">$dd/$file.new") ;<br>
print OUT $page<br>
close OUT ;<br>
<div><div></div><div class="Wj3C7c">} ;<br>
<br>
<br>
_______________________________________________<br>
Ale mailing list<br>
<a href="mailto:Ale@ale.org">Ale@ale.org</a><br>
<a href="http://mail.ale.org/mailman/listinfo/ale" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br>-- <br>James P. Kinney III <br>