So &quot;cleaning up bad MS-HTML&quot; did not include &quot;unlink $crapfile&quot;. Your patience and tolerance is astounding :-)<br><br><div class="gmail_quote">On Sun, Mar 30, 2008 at 6:19 PM, Mike Harrison &lt;<a href="mailto:meuon@geeklabs.com">meuon@geeklabs.com</a>&gt; wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Jim<br>

<div class="Ih2E3d">&gt; I need to update about 43k files and sed just won&#39;t cut it for this<br>

&gt; task. &nbsp; What I need to do is replace 2 lines with 4 new ones, and the<br>

&gt; lines contain URLs (backslashes, brackets, etc.). &nbsp;What I would like<br>

&gt; to do is put the new text in a file and pass it and the search text to<br>

&gt; some program that will modify all the files. &nbsp; Any ideas on whats<br>

&gt; available to do that?<br>

<br>

</div>I&#39;ve not done as much of this as I used to fixing mailQ&#39;s and such<br>

at an ISP, but I always ended up doing it in PERL.<br>

Often with a switch for doing 10 files, writing the changed files<br>

in /tmp so I could manually verify them before bulk changing hundreds of<br>

thousands (or more) files. I&#39;m not as good with find/sed/awk, but one of<br>

the reasons I was doing things like this on Perl is it worked well<br>

when there were lots of files in a single directory, and shell scripting<br>

couldn&#39;t handle the lists of files well.<br>

<br>

I also often found it easier to write and debug complex regex&#39;s in perl<br>

as several steps. Regex&#39;s are incredible, and powerful,<br>

and really easy to do things that you didn&#39;t realize with exceptions.<br>

<br>

I don&#39;t have my old perl scripts from those days,<br>

<br>

But they all had something like what is below (which cleans up bad MS-HTML):<br>

(note, the character encoding in the regex&#39;s didn&#39;t cut and past well into e-mail:<br>

-------------------------------------------------------------------------------------------<br>

opendir(INC,&quot;$dd&quot;) ;<br>

print &quot;Opening: $dd&quot; ;<br>

@incfiles = readdir(INC) ;<br>

closedir INC ;<br>

foreach(sort @incfiles) {<br>

 &nbsp; if(/^\./ ) { } else {<br>

 &nbsp; &nbsp; &nbsp; if(/(.*).html/ ) {<br>

 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $file = $_ ;<br>

 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; fixheader($file) ;<br>

 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #sleep 1 ; &nbsp;# let the server breath. Optional.<br>

 &nbsp; &nbsp; &nbsp; } ;<br>

 &nbsp; };<br>

} ;<br>

<br>

sub fixheader($file) {<br>

 &nbsp;$page = &#39;&#39; ;<br>

 &nbsp;$body = &#39;F&#39; ;<br>

 &nbsp;open(IN,&quot;$dd/$file&quot;) ;<br>

 &nbsp; while(&lt;IN&gt;) {<br>

 &nbsp; &nbsp; if(/\&lt;body/) { $body = &quot;T&quot; ; } ; #don&#39;t process headers..<br>

 &nbsp; &nbsp; if($body eq &quot;T&quot;) {<br>

 &nbsp; &nbsp; &nbsp; $page .= $_ ;<br>

 &nbsp; &nbsp; } ;<br>

 &nbsp; } ; # end while IN<br>

 &nbsp; close IN ;<br>

 &nbsp; $page =~ s/M//g ; &nbsp; &nbsp; &nbsp; #deletes cr&#39;s<br>

 &nbsp; $page =~ s/\&amp;\#13;/[\[P\]\]/g ; #turns encoded CR&#39;s into &lt;P&gt;<br>

 &nbsp; $page =~ s/\U/\[[li]]/g ; # NOTE X is Magic Char 95. &nbsp; Turns into bullets/listed items<br>

 &nbsp; $page =~ s/\n//g ; &nbsp; # deletes lf&#39;s<br>

 &nbsp; #lots more of these..<br>

 &nbsp; open(OUT,&quot;&gt;$dd/$file.new&quot;) ;<br>

 &nbsp; print OUT $page<br>

 &nbsp; close OUT ;<br>

<div><div></div><div class="Wj3C7c">} ;<br>

<br>

<br>

_______________________________________________<br>

Ale mailing list<br>

<a href="mailto:Ale@ale.org">Ale@ale.org</a><br>

<a href="http://mail.ale.org/mailman/listinfo/ale" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br>-- <br>James P. Kinney III <br>