sed-Strip HTML tags (or XML tags)
sed-Strip HTML tags (or XML tags)
(OP)
So far I have this:
It does pretty good with tags all on one line, but things like <img src=blah blah blah that may extend over more than one line are not being caught.
Likewise things like <style type=text/css> ... </style> where I want to remove not just the tags, but the text between the tags are not being caught. Again, <style></style> tag pairs run over multiple lines in the general case.
Is there a way to accomplish this in sed? I can do it in awk already.
CODE
sed -r 's/(<[^>\n]*>)//g'
Likewise things like <style type=text/css> ... </style> where I want to remove not just the tags, but the text between the tags are not being caught. Again, <style></style> tag pairs run over multiple lines in the general case.
Is there a way to accomplish this in sed? I can do it in awk already.
TOP
CSWP, BSSE
www.engtran.com www.niswug.org
www.linkedin.com/in/engineeringtransport
Phenom IIx6 1100T = 8GB = FX1400 = XP64SP2 = SW2009SP3
"Node news is good news."
RE: sed-Strip HTML tags (or XML tags)
Quite a few years ago, I wrote a crude SGML parser in Perl. Is there any reason you are not using Perl to do this?
Perl's switch command searches a string for an arbitrary sequence, and it returns everything in the string up to the sequence, and a everything in the string after the sequence. If you are messing with text, this is an awesome tool.