String.Split throws OutOfMemoryException


Recently I’ve been doing a fair bit of memory profiling and fixing memory issues at work. One interesting cause of an OutOfMemoryException I’ve encountered occurred when using String.Split for several large csv files concurrently. Of course the size and number of lines in the file that triggers the OutOfMemoryException is subjected to the memory capacity of your computer. To reproduce this, it’s relatively easy, first you need to create a large csv file.

Here’s some code to do it.

const int fiftyMillionsLines = 50000000;
using (var fileStream = File.CreateText(@"C:\temp\large.csv"))
{
    for (int i = 1; i <= fiftyMillionsLines; i++)          {                   fileStream.WriteLine("line {0}", i);          }          fileStream.Close();  }  

I have 16GB of physical memory on my machine, so 50 millions lines was the breaking point for me. You can adjust that to create a csv large enough that will break String.Split. Next here’s a simple NUnit test that will illustrate the exception being thrown using the file we generated.

  [Test]  public void StringSplit_ForLargeCsv_WillThrowOutOfMemoryException()  {          var fileContents = File.ReadAllText(@"C:\temp\large.csv");         Assert.Throws(() => fileContents.Split(new[] {"\r\n", "\n"}, StringSplitOptions.RemoveEmptyEntries));
}

Obviously, some form of processing is required for each line you split, and it seems somewhat inefficient to get a collection of split lines upfront. Instead, it would seem more pragmatic to yield each line and perform processing one line at a time. A StringReader would seem fit for that purpose. Here’s a LineSplitter class I created to replace the use of String.Split.

public class LineSplitter : ILineSplitter
{
    protected readonly string[] RowDelimiter = new[] { "\r\n", "\n" };

    public IEnumerable SplitYield(string input)
    {
        using (var stringReader = new StringReader(input))
        {
            int rowIndex = 0;
            string rowString;
            while (!string.IsNullOrEmpty(rowString = stringReader.ReadLine()))
            {
                rowIndex++;
                yield return new StringLine(rowString, rowIndex);
            }
            stringReader.Close();
        }
    }
}

public class StringLine
{
    internal StringLine(string line, int rowIndex)
    {
        Value = line;
        RowIndex = rowIndex;
    }

    public string Value { get; private set; }

    public int RowIndex { get; private set; }
}

Using the same csv file generated, we can now run a test against that using LineSplitter.

[Test]
public void UsingLineSplitter_ForLargeCsv_WillNotThrowOutOfMemoryException()
{
    var fileContents = File.ReadAllText(@"C:\temp\large.csv");

    var lineSplitter = new LineSplitter();
    Assert.DoesNotThrow(()=>lineSplitter.SplitYield(fileContents)
        .Select(stringLine => stringLine.Value).ToList());
}

That test passes, and all contents of the large csv can be loaded into memory without memory exception.

LineSplitter code can be found @ EdLib.

Advertisements
Posted in EdLib. Tags: . Leave a Comment »

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: