Craig Gibbons' Lifeblog lifeblog://


C# – Removing HTML tags from a string using Regular Expresions

Regular expressions are largely misunderstood and shunned by the majority of developers, probably because they have their roots in PERL and to most Microsoft developers, all that *nix stuff is to be kept well away from. It will fry your monitor and crash your hard drive, it might even make your coffee taste bad, besides, who wants to use something some open-source tree-hugging hippy developed 20 years ago for the command line? The other notable barrier to entry is that VBScript only got regular expression support in version 5.0, JScript had it from the beginning so it's a little confusing that Microsoft didn't include it in VBScript from the outset, afterall, if they wanted to create a language to replace JScript, they should have implemented all the features of JScript and more. Truthfully, regular expressions are our friends, they are infinitely useful for an array of operations, which can actually all be accomplished by parsing text the old fashioned way, but which are alltogether more elegant using a regular expression. The syntax is a little difficult to grasp in the beginning but once understood, opens a treasure trove of possibilities to the learned grinning developer. Today, I was faced with a simple problem. I had to write something to remove all the HTML tags from a string. I've done this the old fashioned way some time ago and it worked well, but clearly regular expressions are the "right" way to go about solving this problem. I got out my old favourite regular expressions guide and had a quick look for a regular expression to do the job. Not finding one, I just wrote a basic one, which granted, may not account for every possibility, but which will work for just about 99% of all developers and all scenarios. The expression goes as follows:

Regex regex = new Regex("</?(.*)>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
htmlString = regex.Replace(htmlString, string.Empty);

The beauty of this approach, is that the desired result can be achieved in just 2 lines. Oh yeah, .NET Rocks!

Filed under: Tech Leave a comment
Comments (0) Trackbacks (0)

No comments yet.

Leave a comment

No trackbacks yet.