Table of Contents

Replace / Substitute large chunks of text

See Regular Expressions for more information about Perl‘s pattern matching abilities.

Background

This originally came about because I needed to replace a chunk of html with another smaller chunk. Basically instead of hardcoding a menu into each page, we could use php includes to include the code in a separate file. This gives one place to edit the menus future changes, instead of each individual html page.

So, basically I wanted to find this piece of code:

<!-- left column begins -->
        <div id="leftcol">
                <ul>
                        <li><a href="../life/services.html">Campus Services</a></li>
                        <li><a href="../main/calendar.html">Graduate Calendar</a></li>
                        <li><a href="../gradcatalog/">Catalog</a></li>
                        <li><a href="../funding/">Costs/Funding</a></li>
                        <li><a href="../about/diversity.html">Diversity Initiatives</a></li>
                        <li><a href="../procedures/forms.html">Forms</a></li>
                        <li><a href="../contact/staff.html">Graduate School Staff</a></li>
                        <li><a href="../contact/gpd.html">Graduate Program Directors/Coordinators</a></li>
                        <li><a href="../life/">Grad Life</a></li>
                        <li><a href="../requirements/">Graduation Requirements</a></li>
                        <li><a href="http://www.umbc.edu/gsa/">GSA</a></li>
                        <li><a href="../life/faqs.html">FAQs</a></li>
                        <li><a href="http://www.umbc.edu/oir/cgi-bin/rws3.pl?FORM=UMBC_IncomingGraduateSurvey">New Student Survey</a></li>
                </ul>
        </div>
<!-- left column ends -->

And replace it with this code:

<!-- left column begins -->
        <?php include("../includes/left_menu_life.php"); ?>
<!-- left column ends -->

How to do it

Note: ideas originated from http://www.noctilucent.org/blog/archives/2003/12/replacing_large.html

  1. Make a file called generateRegEx.pl
    • #! /usr/bin/perl -w
      use strict;
      print "s%\n";
      while (<>) {
      	# escape any regex meta-chars
      	s/([].[\\^#|\$%*+?(){}])/\\$1/g;
      	# match trailing whitespace (incl. newlines) on non-empty lines
      	s/(.)$/$1\\s+/;
      	# match any internal whitespace
      	s/(\S)[ \t]+/$1\\s+/g;
      	print $_;
      }
      print <<EOT;
      %PUT
      REPLACEMENT
      TEXT
      HERE
      %six
      EOT
    • This code, creates a regular expression for you
    • It uses the substitute operator s// ( except the /’s are replaced by %’s below )
    • Insert your replacement text where the CAPITAL LETTERS are below
  2. Have a file with the HTML code that is to be substituted, ready
    • called find.html in this example
  3. Use generateRegEx.pl to create a script called substitution.pl
    • $ perl generateRegEx.pl < find.html > substitution.pl
  4. (Optional) Modify substitution.pl to allow for wild-card character matching via (.*) or (.*?)
    • s%
      <!--\s+left\s+column\s+begins\s+-->(.*)
      <!--\s+left\s+column\s+ends\s+-->
      \s+%
                  <!-- left column begins -->
                      <?php include("../includes/left_menu_life.php"); ?>
                  <!-- left column ends -->
       
      %six
    • this allows you to find text that has a set pattern at the begining and/or end, but not in the middle
    • this was helpful to me because my text had the same end and begining, but the middle html, which was a menu, differed in different files.
  5. Test the script on one of the input files
    • $ perl -p -0777 substitution.pl < file01.html | less
    • the above code lets you preview what will happen, it does not actually change any file
  6. Run the script on all of your input files
    • $ perl -p -0777 -i.bak substitution.pl file*.html
    • the above code also conveniently backsup all modified files to *.bak
 
general/unix/perl/replace_large_chunks_of_text.txt · Last modified: 02.05.2008 17:32 by 130.85.181.194
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki