4

I have been searching for the deal with large CSV file read method
Its over 100gb and need to know how deal with the chunk file processing
and make concatenation faster

    %%time
    import time
    filename = "../code/csv/file.csv"
    lines_number = sum(1 for line in open(filename))
    lines_in_chunk = 100# I don't know what size is better
    counter = 0
    completed = 0
    reader = pd.read_csv(filename, chunksize=lines_in_chunk)

CPU times: user 36.3 s, sys: 30.3 s, total: 1min 6s
Wall time: 1min 7s

this won't take long but the problem is concat

%%time
df = pd.concat(reader,ignore_index=True)

this part take too long and take too much memory also
is there way to make this concat process faster and efficiently ?

slowmonk
  • 513
  • 2
  • 9
  • 17

1 Answers1

2

Its too big file to handle by standard way. You could do it by chunk

for chunk in reader:
    chunk['col1']=chunk['col1']**2 #and so on

Or dump yours csv file to database.

number of rows

num=0 
for chunk in reader: 
    num+=1 
num_of_rows = num*lines_of_chunk

#work around in bash and python import subprocess subprocess.check_output(["wc","-l", "file.csv"])

fuwiak
  • 1,373
  • 8
  • 14
  • 26