Questions tagged [processing]

20 questions
15
votes
2 answers

What is the difference between Hadoop and noSQL

I heard about many tools / frameworks for helping people to process their data (big data environment). One is called Hadoop and the other is the noSQL concept. What is the difference in point of processing? Are they complementary?
рüффп
  • 295
  • 5
  • 16
6
votes
2 answers

Cheat Sheet of UNIX commands for Data Science

I was looking for a cheat sheet of UNIX commands, which are specifically usable for data science. I mean an introduction to very basic commands (starting from cd, ls, pwd, to some still simple but usable for data - e.g. wc, a few simple things with…
Piotr Migdal
  • 846
  • 7
  • 16
6
votes
1 answer

How to split natural language script into segments?

I have a bunch of .txt and .srt files extracted from a MOOC website, they are the scripts of the videos. I would like to segment the scripts into parts such that each part falls into one of the following categories: MainConceptDescription->…
A.D.
  • 205
  • 1
  • 6
4
votes
1 answer

Alignment of square nonorientable images/data

Another post where I don't know enough terminology to describe things efficiently. For the comments, please suggest some tags and keywords I can add to this post to make it better. Say I have a 2D data structure where 'orientation' doesn't matter.…
Mark
  • 213
  • 1
  • 6
3
votes
2 answers

Storing Large dataset for processing and analysis of data

I am new to data engineering and wanted to know , what is the best way to store more than 3000 GB of data for further processing and analysis ? I am specifically looking for open source resources . I have explored many data formats for storage . The…
3
votes
1 answer

Is it possible to automate generating reproducibility documentation?

First, think it's worth me stating what I mean by replication & reproducibility: Replication of analysis A results in an exact copy of all inputs and processes that are supply and result in incidental outputs in analysis B. Reproducibility of…
blunders
  • 1,932
  • 2
  • 15
  • 19
3
votes
2 answers

Optimization of pandas row iteration and summation

i'm wondering if anyone can provide some input on improving the speed and calculations of a pandas result. What i am trying to obtain is a summation of IDs in one table (player table) based on each row of a second table (UUID). Functionally each…
Cody
  • 133
  • 3
3
votes
3 answers

OCR / Text Recognition and Recovery Problem

I am working on a research project that deals with American military casualties during WWII. Specifically, I am attempting to construct a count of casualties for each service at the county level. There are two sources of data here, each presenting…
ekrose
  • 31
  • 2
3
votes
1 answer

Pre-processing (center, scale, impute) among training sets (different forms) and the test set - what is a good approach?

I am currently working on a multi-class classification problem with a large training set. However, it has some specific characteristics, which induced me to experiment with it, resulting in few versions of the training set (as a result of…
2
votes
1 answer

Meltdown patch impact on data processing speeds

The patch for the Meltdown vulnerability disables speculative execution, which will impact all processing activities. The degree of impact is highly dependent on the type of processing being done. Is there hard data or experience of how machine…
GdD
  • 121
  • 3
2
votes
1 answer

XGBoost GPU version not outperforming CPU on small dataset despite parameter tuning – suggestions needed

I'm currently working on a Parallel and Distributed Computing project where I'm comparing the performance of both XGBoost and CatBoost when trained on CPU vs GPU. The goal is to demonstrate how GPU acceleration can improve training time, especially…
Mxneeb
  • 21
  • 1
1
vote
1 answer

Do I load all files at once or one at a time?

I currently have $1700+$ CSV files. Each of them is in the same format and structure, give or take a row or possibly a column at the end. Each CSV is $\approx 3.8$ MB. I need to perform a transformation on each file Extract one data set, perform a…
1
vote
0 answers

CycleGAN vs. AutoEncoder for transforming sketches into images

I'm playing around with the use of deep learning on images and done quite works : colorizing black and white images for example, or maybe fixing old damaged photos. Today I want to tackle a new problem, concerning the conversion of sketches into…
1
vote
2 answers

Running huge datasets with R

I'm trying to run some analysis with some big datasets (eg 400k rows vs. 400 columns) with R (e.g. using neural networks and recommendation systems). But, it's taking too long to process the data (with huge matrices, e.g. 400k rows vs. 400k…
1
vote
1 answer

Gunicorn workers timeout

I'm using Flask where i load some pre-trained machine learning models once. I'm also using Gunicorn usually with 2 or 4 workers to handle parallel requests. Every request contains some texts that i want to analyze. I'll explain my problem with a…
porfgian
  • 173
  • 1
  • 1
  • 10
1
2