Any one read the over 100gb csv file and successfully concatenation?

Question

I have been searching for the deal with large CSV file read method
Its over 100gb and need to know how deal with the chunk file processing
and make concatenation faster

    %%time
    import time
    filename = "../code/csv/file.csv"
    lines_number = sum(1 for line in open(filename))
    lines_in_chunk = 100# I don't know what size is better
    counter = 0
    completed = 0
    reader = pd.read_csv(filename, chunksize=lines_in_chunk)

CPU times: user 36.3 s, sys: 30.3 s, total: 1min 6s
Wall time: 1min 7s

this won't take long but the problem is concat

%%time
df = pd.concat(reader,ignore_index=True)

this part take too long and take too much memory also
is there way to make this concat process faster and efficiently ?

score 2 · Answer 1 · edited Jun 16 '20 at 11:08

2

Its too big file to handle by standard way. You could do it by chunk

for chunk in reader:
    chunk['col1']=chunk['col1']**2 #and so on

Or dump yours csv file to database.

number of rows

num=0 
for chunk in reader: 
    num+=1 
num_of_rows = num*lines_of_chunk
#work around in bash and python
import subprocess
subprocess.check_output(["wc","-l", "file.csv"])

edited Jun 16 '20 at 11:08

Community

1

answered Jul 25 '19 at 11:09

fuwiak

1,373
8
14
26

Any one read the over 100gb csv file and successfully concatenation?

1 Answers1

number of rows