How to convert HTML text into plain text using C#?

I have a requirement in which I need to render/display plain text from HTML content, so I would like to know how can I do it using C#?

For example, below is my HTML code

<p>Some text here</p>
<div>Some more <strong>text</strong></div>

I would like to have output as below in plain text

Some text here. Some more text

Any sample code or tutorial to do it?

Asked by:- jon
: 6807 At:- 6/6/2018 11:49:35 AM
C# HTML to plain text

1 Answers
profileImage Answered by:- vikas_jk

You can create a helper class to convert your HTML content into plain text and use the below method written in C#

 public static string HTMLToText(string HTMLCode)
            // Remove new lines since they are not visible in HTML
            HTMLCode = HTMLCode.Replace("\n", " ");

            // Remove tab spaces
            HTMLCode = HTMLCode.Replace("\t", " ");

            // Remove multiple white spaces from HTML
            HTMLCode = Regex.Replace(HTMLCode, "\\s+", " ");

            // Remove HEAD tag
            HTMLCode = Regex.Replace(HTMLCode, "<head.*?</head>", ""
                                , RegexOptions.IgnoreCase | RegexOptions.Singleline);

            // Remove any JavaScript
            HTMLCode = Regex.Replace(HTMLCode, "<script.*?</script>", ""
              , RegexOptions.IgnoreCase | RegexOptions.Singleline);

            // Replace special characters like &, <, >, " etc.
            StringBuilder sbHTML = new StringBuilder(HTMLCode);
            // Note: There are many more special characters, these are just
            // most common. You can add new characters in this arrays if needed
            string[] OldWords = {"&nbsp;", "&amp;", "&quot;", "&lt;",
   "&gt;", "&reg;", "&copy;", "&bull;", "&trade;","&#39;"};
            string[] NewWords = { " ", "&", "\"", "<", ">", "®", "©", "•", "™","\'" };
            for (int i = 0; i < OldWords.Length; i++)
                sbHTML.Replace(OldWords[i], NewWords[i]);

            // Check if there are line breaks (<br>) or paragraph (<p>)
            sbHTML.Replace("<br>", "\n<br>");
            sbHTML.Replace("<br ", "\n<br ");
            sbHTML.Replace("<p ", "\n<p ");

            // Finally, remove all HTML tags and return plain text
            return System.Text.RegularExpressions.Regex.Replace(
              sbHTML.ToString(), "<[^>]*>", "");

The above method takes the HTML content and remove's all the HTML tags and code and provide you output as plain text, you can check the online sample here

Go to the above link and run it, you can see output as

Some text here Some more text


At:- 6/7/2018 3:16:58 PM
perfect answer, that's what i was looking for, thank you 0
By : jon - at :- 6/18/2018 2:27:32 PM

Login/Register to answer
Register directly by posting answer/details

Full Name *

Email *

By posting your answer you agree on privacy policy & terms of use

Subscribe Now

Subscribe to our weekly Newsletter & Keep getting latest article/questions in your inbox weekly