Ever tried to build an import tool that needs to process thousands of CSV files at once? I have, and I learned the hard way that simply starting a thousand file operations simultaneously is a recipe for disaster.

Let’s talk about how to handle this real-world problem in C#. I’ll show you some practical ways to process huge batches of files without bringing your system to its knees. We’ll look at how to use tools like SemaphoreSlim and Task.WhenAll, plus some smart ways to handle async file operations.

What Goes Wrong When Processing Thousands of Files

Let’s visualize the different approaches to file processing and their impacts:

Diagram comparing sequential, parallel, and throttled approaches to file processing with their pros and cons

Different File Processing Approaches: Sequential vs. Parallel vs. Throttled with SemaphoreSlim

When you’re building a tool to import thousands of CSV files, you might start with something like this:

public async Task ProcessAllFilesNaively(List<string> filePaths)
{
    var tasks = new List<Task>();

    foreach (var path in filePaths)
    {
        tasks.Add(ProcessFileAsync(path));
    }

    await Task.WhenAll(tasks);
}

private async Task ProcessFileAsync(string path)
{
    using var reader = new StreamReader(path);
    var content = await reader.ReadToEndAsync();
    // Process content...
}

This looks simple enough, but it falls apart quickly when you throw 5,000 files at it:

Your OS will scream at you: Operating systems limit how many files you can have open at once. Hit that limit and you get crashes.

Your memory usage explodes: With each file read loading the entire content into memory, RAM gets eaten up fast.

Your disk gets hammered: Physical disks (even SSDs) struggle when hit with too many random read operations at once.

The thread pool gets swamped: Too many I/O operations can actually block other parts of your application from getting work done.

Throttling with SemaphoreSlim: The Traffic Cop for Your Tasks

The secret to handling loads of files is to limit how many you process at once. That’s where SemaphoreSlim comes in, think of it as a traffic cop for your tasks:

public async Task ProcessFilesWithSemaphore(List<string> filePaths)
{
    // Process about 2 files per CPU core at once
    int maxConcurrency = Environment.ProcessorCount * 2;
    using var semaphore = new SemaphoreSlim(maxConcurrency);
    var tasks = new List<Task>();

    foreach (var path in filePaths)
    {
        await semaphore.WaitAsync(); // Wait for an open slot

        tasks.Add(Task.Run(async () =>
        {
            try
            {
                await ProcessFileAsync(path);
            }
            finally
            {
                semaphore.Release(); // Always release when done!
            }
        }));
    }

    await Task.WhenAll(tasks);
}

This approach limits the number of concurrent file operations to a manageable level (here, twice the number of processor cores). The SemaphoreSlim acts as a gatekeeper, ensuring we don’t overwhelm system resources.

Here’s how the SemaphoreSlim controls the flow of tasks:

Diagram showing SemaphoreSlim managing a queue of file tasks with limited concurrent workers

SemaphoreSlim as a Traffic Controller: Managing Concurrent File Operations with Worker Threads

Better Ways to Read Files

Just controlling how many files we process isn’t enough. We also need to be smart about how we read each file:

private async Task ProcessFileAsync(string path)
{
    // Set up a proper async stream
    using var fileStream = new FileStream(
        path,
        FileMode.Open,
        FileAccess.Read,
        FileShare.Read,
        bufferSize: 4096,
        useAsync: true); // This flag makes a huge difference!

    using var reader = new StreamReader(fileStream);

    // Process one line at a time instead of the whole file
    string line;
    while ((line = await reader.ReadLineAsync()) != null)
    {
        await ProcessCsvLineAsync(line);
    }
}

Here’s what makes this approach better:

  1. The useAsync: true flag tells .NET to use truly non-blocking I/O operations
  2. Reading line by line avoids loading entire files into memory at once
  3. The 4096-byte buffer size works well for most scenarios (it matches common disk page sizes)

A Real CSV Batch Processor That Won’t Fall Over

Let’s build something that actually works in the real world, a CSV import system that can handle thousands of files:

public class CsvBatchProcessor
{
    private readonly int _maxConcurrency;
    private readonly IDataImportService _importService;

    public CsvBatchProcessor(IDataImportService importService, int? maxConcurrency = null)
    {
        _importService = importService;
        // Default to 2 times CPU count, but let caller override
        _maxConcurrency = maxConcurrency ?? Environment.ProcessorCount * 2;
    }

    public async Task<BatchImportResult> ImportFilesAsync(IEnumerable<string> filePaths)
    {
        using var semaphore = new SemaphoreSlim(_maxConcurrency);
        var tasks = new List<Task<FileImportResult>>();
        var result = new BatchImportResult();

        foreach (var path in filePaths)
        {
            await semaphore.WaitAsync();

            var task = ProcessSingleFileAsync(path, semaphore)
                .ContinueWith(t =>
                {
                    if (t.IsFaulted)
                    {
                        // Capture errors so one bad file doesn't sink everything
                        return new FileImportResult
                        {
                            FileName = Path.GetFileName(path),
                            Success = false,
                            Error = t.Exception?.InnerException?.Message
                        };
                    }
                    return t.Result;
                });

            tasks.Add(task);
        }

        var results = await Task.WhenAll(tasks);

        // Summarize the results
        result.SuccessCount = results.Count(r => r.Success);
        result.FailureCount = results.Count(r => !r.Success);
        result.FileResults = results.ToList();

        return result;
    }
      private async Task<FileImportResult> ProcessSingleFileAsync(string path, SemaphoreSlim semaphore)
    {
        try
        {
            using var fileStream = new FileStream(
                path,
                FileMode.Open,
                FileAccess.Read,
                FileShare.Read,
                bufferSize: 4096,
                useAsync: true);

            using var reader = new StreamReader(fileStream);

            var records = new List<CsvRecord>();
            string line;
            int lineNumber = 0;

            // Skip the header
            await reader.ReadLineAsync();
            lineNumber++;

            // Go through each line
            while ((line = await reader.ReadLineAsync()) != null)
            {
                lineNumber++;
                if (string.IsNullOrWhiteSpace(line)) continue;

                try
                {
                    var record = ParseCsvLine(line);
                    records.Add(record);
                }
                catch (Exception ex)
                {
                    // Track exactly which line caused problems
                    return new FileImportResult
                    {
                        FileName = Path.GetFileName(path),
                        Success = false,
                        Error = $"Error parsing line {lineNumber}: {ex.Message}"
                    };
                }
            }

            // Save all records from this file in one go
            await _importService.SaveRecordsAsync(records);

            return new FileImportResult
            {
                FileName = Path.GetFileName(path),
                Success = true,
                RecordsProcessed = records.Count
            };
        }
        catch (Exception ex)
        {
            return new FileImportResult
            {
                FileName = Path.GetFileName(path),
                Success = false,
                Error = ex.Message
            };
        }
        finally
        {
            // ALWAYS release the semaphore, even if an exception occurs
            semaphore.Release();
        }
    }
      private CsvRecord ParseCsvLine(string line)
    {
        // You'd want a real CSV parser in production!
        var parts = line.Split(',');
        return new CsvRecord
        {
            Id = int.Parse(parts[0]),
            Name = parts[1],
            Value = decimal.Parse(parts[2])
        };
    }
}

// Simple classes to track our import results
public class FileImportResult
{
    public string FileName { get; set; }
    public bool Success { get; set; }
    public string Error { get; set; }
    public int RecordsProcessed { get; set; }
}

public class BatchImportResult
{
    public int SuccessCount { get; set; }
    public int FailureCount { get; set; }
    public List<FileImportResult> FileResults { get; set; }
}

public class CsvRecord
{
    public int Id { get; set; }
    public string Name { get; set; }
    public decimal Value { get; set; }
}

public interface IDataImportService
{
    Task SaveRecordsAsync(List<CsvRecord> records);
}

This implementation includes:

  1. A configurable concurrency limit
  2. Proper error handling for both file-level and record-level errors
  3. A result tracking system to report success and failure counts
  4. Efficient file reading using async I/O

Here’s a visualization of the complete CSV batch processing workflow:

Flowchart showing the complete CSV batch processing workflow from file enumeration to database storage

End-to-End CSV Batch Processing: From File Enumeration to Results Collection with Throttled Concurrency

How Different Approaches Stack Up

Let’s see how the different ways of handling multiple files compare:

ApproachProsCons
One at a timeDead simple code, uses minimal resourcesPainfully slow with lots of files
All at onceFastest in theoryCan crash your app with “Too many open files” or memory errors
Controlled parallelism (SemaphoreSlim)Good speed without crashingTakes a bit more code to implement right

Here’s a visualization of how each strategy impacts your system resources:

Comparison chart showing system resource impacts of naive parallel, SemaphoreSlim throttled, and sequential processing approaches

Resource Impact Analysis: How Different File Processing Strategies Affect System Resources

Tweaking for Better Performance

Here are some simple adjustments that can make a big difference:

Match your hardware: Different storage types can handle different loads. SSDs can usually take more parallel operations:

// SSDs can handle more parallel reads
int maxConcurrency = isSsd ? Environment.ProcessorCount * 8 : Environment.ProcessorCount * 2;

Buffer size matters: For big files, bigger buffers often help:

// Bigger buffer for reading larger files
const int largerBuffer = 8192; // or 16384 for really big files
using var fileStream = new FileStream(path, FileMode.Open, FileAccess.Read,
    FileShare.Read, bufferSize: largerBuffer, useAsync: true);

Batch database work: Never make one database call per file, batch them up:

// Read all the files first
var allRecords = await Task.WhenAll(filePaths.Select(async path =>
{
    // Get records from each file
    return await ExtractRecordsFromFileAsync(path);
}));

// Then one big database operation
await _dataService.SaveBatchAsync(allRecords.SelectMany(r => r).ToList());

Getting Fancy: Smart Throttling Based on System Load

For serious production systems, here’s something cool, a file processor that watches system resources and adjusts itself:

public class AdaptiveFileProcessor
{
    private SemaphoreSlim _semaphore;
    private int _currentConcurrency;
    private readonly int _minConcurrency = 4;
    private readonly int _maxConcurrency = Environment.ProcessorCount * 4;

    public AdaptiveFileProcessor()
    {
        _currentConcurrency = Environment.ProcessorCount * 2;
        _semaphore = new SemaphoreSlim(_currentConcurrency);
    }

    public async Task ProcessFilesAsync(List<string> filePaths)
    {
        // Start a background task that watches system resources
        var monitorTask = Task.Run(MonitorSystemResourcesAsync);

        var tasks = new List<Task>();
        foreach (var path in filePaths)
        {
            await _semaphore.WaitAsync();

            tasks.Add(Task.Run(async () =>
            {
                try
                {
                    await ProcessFileAsync(path);
                }
                finally
                {
                    _semaphore.Release();
                }
            }));
        }

        await Task.WhenAll(tasks);

        // In real code, you'd cancel the monitor when done
    }

    private async Task MonitorSystemResourcesAsync()
    {
        while (true)
        {
            await Task.Delay(5000); // Check every 5 seconds

            // You'd use real performance counters here
            double cpuUsage = GetCurrentCpuUsage();
            double memoryPressure = GetMemoryPressure();

            if (cpuUsage > 85 || memoryPressure > 85)
            {
                // System stressed? Slow down!
                AdjustConcurrency(Math.Max(_currentConcurrency / 2, _minConcurrency));
            }
            else if (cpuUsage < 40 && memoryPressure < 50)
            {
                // System idle? Speed up!
                AdjustConcurrency(Math.Min(_currentConcurrency + 2, _maxConcurrency));
            }
        }
    }

    private void AdjustConcurrency(int newConcurrency)
    {
        if (newConcurrency == _currentConcurrency) return;

        if (newConcurrency > _currentConcurrency)
        {
            // We can add more slots easily
            _semaphore.Release(newConcurrency - _currentConcurrency);
        }
        else
        {
            // But reducing is tricky - we make a new semaphore
            var oldSemaphore = _semaphore;
            _semaphore = new SemaphoreSlim(newConcurrency);
            // Old semaphore gets cleaned up when all tasks release it
        }

        _currentConcurrency = newConcurrency;
        Console.WriteLine($"Adjusted concurrency to {newConcurrency}");
    }

    // You'd implement these for real
    private double GetCurrentCpuUsage() => 50;
    private double GetMemoryPressure() => 50;
}

The adaptive system works by continuously monitoring resource usage and adjusting the concurrency:

State diagram showing adaptive concurrency system monitoring CPU and memory usage to adjust processing speed

Adaptive Concurrency Control: Dynamic Adjustment Based on CPU and Memory Pressure

Wrapping Up

When it comes to processing thousands of files in C#, going all-in with parallelism isn’t the answer, and neither is processing them one at a time. The sweet spot is controlled parallelism, doing enough at once to be fast, but not so many that you crash.

With tools like SemaphoreSlim, proper async file techniques, and smart batching, you can build systems that handle huge file loads without breaking a sweat.

Next time you need to import 5,000 CSVs, don’t worry, just remember to throttle your operations, use real async I/O, and be smart about resource usage. Your users (and your server) will thank you!

Frequently Asked Questions

How many files can I process concurrently in C#?

There’s no hard limit, but processing too many files simultaneously can overwhelm system resources. A good starting point is 50–100 concurrent operations, adjusted based on your hardware, file sizes, and processing needs. Use SemaphoreSlim to control concurrency and monitor system performance to find the optimal number for your scenario.

Why is my file processing application running out of memory?

You’re likely keeping too many file contents in memory at once or starting too many tasks without throttling. Implement batching to process files in smaller groups, use streams instead of loading entire files into memory, dispose of resources properly, and consider throttling with SemaphoreSlim to limit concurrent operations.

Should I use Parallel.ForEach or Task.WhenAll for processing multiple files?

Task.WhenAll with async/await is usually better for IO-bound operations like file processing because it doesn’t block threads. Parallel.ForEach is designed for CPU-bound work and uses thread pool threads. With Task.WhenAll, you also get more control over concurrency limits using SemaphoreSlim for throttling.

How can I make my file reads faster in C#?

Use async methods (ReadAllTextAsync, ReadAllLinesAsync), enable FileOptions.Asynchronous when creating FileStreams, process files concurrently but with throttling, consider memory-mapped files for large files, use buffered reads, and if possible, read from SSDs rather than HDDs. Also check your virus scanner settings, as they can significantly slow down file operations.

How do I handle errors when processing thousands of files?

Use a combination of try/catch blocks within individual file processing tasks and track failed files separately. Consider implementing retry logic with exponential backoff for transient errors. Avoid letting one file failure crash the entire batch by catching exceptions at the task level and logging details for later investigation.

How can I show progress while processing thousands of files?

Use a progress reporting pattern with IProgress<T> and Progress<T>. Update a counter each time a file completes and report the percentage based on total files. For console applications, consider a simple progress bar. For UIs, update the UI thread safely using the Progress<T> class which marshals callbacks to the creating thread.
See other c-sharp posts