Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations JAE on being selected by the Eng-Tips community for having the most helpful posts in the forums last week. Way to Go!

sed-Strip HTML tags (or XML tags)

Status
Not open for further replies.

kellnerp

Mechanical
Joined
Feb 11, 2005
Messages
1,141
Location
US
So far I have this:
Code:
sed -r 's/(<[^>\n]*>)//g'
It does pretty good with tags all on one line, but things like <img src=blah blah blah that may extend over more than one line are not being caught.

Likewise things like <style type=text/css> ... </style> where I want to remove not just the tags, but the text between the tags are not being caught. Again, <style></style> tag pairs run over multiple lines in the general case.

Is there a way to accomplish this in sed? I can do it in awk already.

TOP
CSWP, BSSE
Phenom IIx6 1100T = 8GB = FX1400 = XP64SP2 = SW2009SP3
"Node news is good news."
 
kellnerp,

Quite a few years ago, I wrote a crude SGML parser in Perl. Is there any reason you are not using Perl to do this?

Perl's switch command searches a string for an arbitrary sequence, and it returns everything in the string up to the sequence, and a everything in the string after the sequence. If you are messing with text, this is an awesome tool.

Critter.gif
JHG
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top