4

Can anyone recommend any command line tool for converting large CSV file into HDF5 format?

Tauno
  • 819
  • 3
  • 10
  • 9

1 Answers1

2
import numpy as np
import pandas as pd

#filename = '/tmp/test.hdf5' filename = 'D:\test.hdf5'

df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['C1', 'C2']) print(df)

C1 C2

0 0 1

1 2 3

2 4 5

3 6 7

Save to HDF5

df.to_hdf(filename, 'data', mode='w', format='table') del df # allow df to be garbage collected

Append more data

df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['C1', 'C2']) df2.to_hdf(filename, 'data', append=True)

print(pd.read_hdf(filename, 'data'))

  • 2nd approach: you could append to a HDFStore instead of calling df.to_hdf:
import numpy as np
import pandas as pd

#filename = '/tmp/test.hdf5' filename = 'D:\test.hdf5' store = pd.HDFStore(filename)

for i in range(2): df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['C1', 'C2']) store.append('data', df)

store.close()

store = pd.HDFStore(filename) data = store['data'] print(data) store.close()

  • 3rd approach: using chunksize parameter and append each chunk to the HDF file which was answered here.

Personally, I like the 1st and 2nd approaches.

Shayan Shafiq
  • 1,008
  • 4
  • 13
  • 24
Mario
  • 571
  • 1
  • 6
  • 24