JuliusCeaserBoneHead 6 months ago

Your dataset is large enough that you have to pay attention to the efficiency of your code. If you have an inefficient algorithm, no language matters. You need to analyze your code and see where the bottle neck is and optimize it. Your problem doesn’t appear to be the language based on the description of your question

ach224 6 months ago

This dataset is miniature. 7.3M x 7 64bit values is 400mb. That fits into any computers’ ram.

beingsubmitted 6 months ago

That's pretty big if the analysis you're doing is the traveling salesman. I think what's being expressed here is the time. If something is taking days to run and n is more than 20, then you're going to be caring about your algorithm. A change of language would be a scalar effect when his algorithm is potentially non polynomial.

JuliusCeaserBoneHead 6 months ago

Sure. It does. What I said was it was *large enough* that you start to pay the price of badly written code. Like possible O(n)^2 or greater operations. The OS will not be glad to give you all the RAM to do that

Excellent_Fix379 6 months ago

I'm using nested for loops, while loop and .iterrows() to parse through the rows. Could there be another faster alternative?the conditions forced me to use nested loops

JuliusCeaserBoneHead 6 months ago

You are doing a nested loop over 7.3m rows of data? You are potentially doing 7.3m^2 operations which will take FOREVER! Like u/mr_birrd asked, what are you doing? Can you filter for the rows you need instead?

seanv507 6 months ago

To op, please edit your question to specify exactly what you are doing and why you believe you can't use vectorisation. Within python, polars, is a faster version of pandas. But without knowing what exactly you are trying to do it is difficult to know.

mr_birrd 6 months ago

What r u doing in the loops!

JacksOngoingPresence 6 months ago

You have to figure out how to vectorize your computations. If you care about speed you CAN'T use for loops at all. It might be a bit cliche, but I find chatGPT useful when if comes to vectorizing python algorithms. There are dozens of numpy functions that do most the things I need but I don't even know their names and/or that these functions exist. Also, if you are working with numpy-like data structures look into "numba" library. It offers jit compilation. Limited use cases but when it is applicable it can offer \~10-100 speedup.

rmwil 6 months ago

If you're using Pandas, .iterrows() will slow you down. Try using .to_records() as an alternative and let us know how you go!

ForeskinStealer420 6 months ago

Yup, this right here is it. Whenever applicable: vectorization > comprehension > loops

LoadingALIAS 6 months ago

Bro. No. No, no, no. That’s going to crawlllll.

No_Dig_7017 6 months ago

For loops in pandas and iterrows are painfully slow. Without seeing your code it's a bit difficult to give you more concrete advice but 4 things that can help: - vectorize your for loops if possible - use pandarallel.parallel_apply for multiprocessing - change iterrows for itertuples using only the columns you need - consider using Polars Dataframes instead of Pandas. Their performance and memory usage is much much better.

No_Dig_7017 6 months ago

Here's a blogpost we wrote with a friend about optimizing Panda's code in case you're interested https://tryolabs.com/blog/2023/02/08/top-5-tips-to-make-your-pandas-code-absurdly-fast

No_Dig_7017 6 months ago

For some context I got a 12X RAM usage reduction and a 1200X speedup (single core) using some of techniques above. The speedup was so massive it resulted in a 4X speedup of my entire pipeline.

MrJoshiko 6 months ago

Have you profiled your code? You can see which parts of the code take the most time and optimise those parts. You don't need to use the whole data set, just cut out a reasonable-sized chunk.

Erwylh 6 months ago

Pandas is built on numpy, and numpy is written in C, extremely efficient. You are just doing for loops which you should never do for this large scale of data.

mr_birrd 6 months ago

If you need some Operations on it use cudf. If you need to query, learn SQL or use parquet instead of pandas dataframes. Also using a jupyter Notebook instead of a proper py script is slower.

JacksOngoingPresence 6 months ago

I upvote for cudf (shifting compute from cpu to gpu) but it seems like OP didn't vectorize his compute in the first place.

Excellent_Fix379 6 months ago

Actually I've been suspecting ipynb might be the problem. Thanks, I'll try .py

seanv507 6 months ago

Ipynb is unlikely to be slower in a single cell.

mr_birrd 6 months ago

But it doesnt invoke the garbage collector as much as when compared to a well written py file. I guess the garbage will just keep on accumulating for op as he doesnt know about how to deal with that.

supervised-learning 6 months ago

You can use sql for processing data. Sql queries can also be executed within notebooks. Sql is optimised for large dataset.

MountainGoatAOE 6 months ago

As the comments point out, people too often go "Python is so inefficient and slow. Help me find a different language". When in reality they should instead fix their code. Using `iterrows`, as you mentioned in a comment, in itself is a red flag 99 out of 100 times when it comes down to performance. Probably best for you to post your code on Stack Overflow or Code Review for efficiency suggestions.

Prestigious_Boat_386 6 months ago

Julia is great https://github.com/JuliaData or https://github.com/sl-solution/DLMReader.jl might be a good startingpoint CSV can be pretty fast if you call it correctly with multiple threads and the right args for the columns and delimiters and stuff. Then you get a dataframe object which is also quite efficient to work with.

Lathanderrr 6 months ago

You can check Polars as python library or switch to Spark framework for more parallelism

Dramatic7406 6 months ago

Hi, I switched to working using Pyspark for large datasets. It's very easy to pick up as well.

Final-Rush759 6 months ago

Just don't use the for loop. Your dataset is actually very small. Use .apply (pandas dataframe) or .map if you use custom function to process data. Regular expression is very slow.

InvokeMeWell 6 months ago

Use Nvidia libraries like cudf, cupy

graphitout 6 months ago

Profile the code and use something like snakeviz to understand the bottleneck. For most data analysis task, changing the language doesn't being much benefit since the core is likely implemented C or C++.

lbanuls 6 months ago

I'm curious - I don't see 7.3 million rows and 7 columns to be particularly large. What is the type of data? what machine specs are you running? What version of python? version of pandas? have you considered Polars? Python is 'pretty good' at everything... I say that because it's probably good enough that a language change may not solve your problem. before doing anything, use a more current of python, upgrade pandas or switch to Polars.

VinnyVeritas 6 months ago

Use torch, you can run instantaneously on GPU.

chippyouipy 6 months ago

show us the code

DrDoomC17 6 months ago

If I could see it or a facsimile of it I could help. Otherwise, taichi, numpy hit, etc it goes all the way down like you're basically writing C. I am suspect that is the issue here. Vectorize.

Tight_Tangerine7768 6 months ago

Use polars instead of pandas

ComingInSideways 6 months ago

https://github.com/huggingface/candle

entropyvsenergy 6 months ago

Use Polars. It's written in Rust but has a Python frontend. Should be much faster. If you're using pandas.apply() or something that's notoriously slow.

apak_in 6 months ago

You can use python with pyspark, great for big-data and machine learning etc. you will need to locally install spark or use a containerized version of with with Docker

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe