Best way to split a string into words

Hello folks,

To understand this process, first you have to know what is:

  • word chars are letters, numbers and underscode
  • non-word chars are special chars like punctuations
  • white-space chars are blank space and carrier return or new line.

This method splits a string into words in 3 different ways:

  • Splitting into words and keeping all non-word chars. Example:
    • We’re 100% done, now loading…
      We’re
      100%
      done,
      now
      loading…

    Using the white-space char \s to split here.

  • Splitting into words and keeping some custom non-word chars. Example:
    • We’re 100% done, now loading…
      We’re
      100%
      done
      now
      loading

    This requires a bit more of coding. The custom non-word chars is set by the argument DontIncludeThese

  • Spliting into words and excluding all non-word chars. Example:
    • We’re 100% done, now loading…
      We
      re
      100
      done
      now
      loading

    This is fairly easy. Just apply the non-word char regular expression \W

See the difference? Now, I will show you how to accomplish that.

This method requires .NET Framework 4.0. Please visit the Microsoft Developer Network to learn more about the C# 4.0 Language Specification and the Optional Arguments.

/// <summary>
/// Splits this string and returns an array of words separated by any whitespace char
/// </summary>
/// <param name="WordsOnly">Remove non-word chars from results.</param>
/// <param name="DontIncludeThese">List of custom chars to remove from results</param>
public static string[] ToWords(this string source, bool WordsOnly = true,
    params char[] DontIncludeThese)
{
    source = source.Trim();

    if (source == "")
        return new string[]{};

    if (WordsOnly) //only letter, number and underscores
    {
        Regex re = new Regex(@"[\W]+", RegexOptions.Compiled);
        string[] words = re.Split(source).ToArray();
        return words;
    }
    else //split by whitespace character
    {
        string symbols = "";
        foreach (char s in DontIncludeThese) //custom list of chars to add into the split
            symbols += s;

        Regex re = new Regex("[" + symbols + @"\s]+", RegexOptions.Compiled);
        string[] words = re.Split(source).ToArray();
        return words;
    }
            
}

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s