How to convert HTML text into plain text using C#?


I have a requirement in which I need to render/display plain text from HTML content, so I would like to know how can I do it using C#?

For example, below is my HTML code

<p>Some text here</p>
<div>Some more <strong>text</strong></div>

I would like to have output as below in plain text

Some text here. Some more text

Any sample code or tutorial to do it?


Asked by:- jon
0
: 344 At:- 6/6/2018 11:49:35 AM
C# HTML to plain text






1 Answers
profileImage Answered by:- vikas_jk

You can create a helper class to convert your HTML content into plain text and use the below method written in C#

 public static string HTMLToText(string HTMLCode)
        {
            // Remove new lines since they are not visible in HTML
            HTMLCode = HTMLCode.Replace("\n", " ");

            // Remove tab spaces
            HTMLCode = HTMLCode.Replace("\t", " ");

            // Remove multiple white spaces from HTML
            HTMLCode = Regex.Replace(HTMLCode, "\\s+", " ");

            // Remove HEAD tag
            HTMLCode = Regex.Replace(HTMLCode, "<head.*?</head>", ""
                                , RegexOptions.IgnoreCase | RegexOptions.Singleline);

            // Remove any JavaScript
            HTMLCode = Regex.Replace(HTMLCode, "<script.*?</script>", ""
              , RegexOptions.IgnoreCase | RegexOptions.Singleline);

            // Replace special characters like &, <, >, " etc.
            StringBuilder sbHTML = new StringBuilder(HTMLCode);
            // Note: There are many more special characters, these are just
            // most common. You can add new characters in this arrays if needed
            string[] OldWords = {"&nbsp;", "&amp;", "&quot;", "&lt;",
   "&gt;", "&reg;", "&copy;", "&bull;", "&trade;","&#39;"};
            string[] NewWords = { " ", "&", "\"", "<", ">", "®", "©", "•", "™","\'" };
            for (int i = 0; i < OldWords.Length; i++)
            {
                sbHTML.Replace(OldWords[i], NewWords[i]);
            }

            // Check if there are line breaks (<br>) or paragraph (<p>)
            sbHTML.Replace("<br>", "\n<br>");
            sbHTML.Replace("<br ", "\n<br ");
            sbHTML.Replace("<p ", "\n<p ");

            // Finally, remove all HTML tags and return plain text
            return System.Text.RegularExpressions.Regex.Replace(
              sbHTML.ToString(), "<[^>]*>", "");
        }

The above method takes the HTML content and remove's all the HTML tags and code and provide you output as plain text, you can check the online sample here http://rextester.com/AKMG13869

Go to the above link and run it, you can see output as

Some text here Some more text

done.

1
At:- 6/7/2018 3:16:58 PM
perfect answer, that's what i was looking for, thank you 0
By : jon - at :- 6/18/2018 2:27:32 PM





Login/Register to answer
Or
Register directly by posting answer/details

Full Name *

Email *




By posting your answer you agree on privacy policy & terms of use