Faster indexing in Elastic Search through rake task in rails - amazon-web-services

I have more over 1.5 million records in my MySQL database. If I run the rake task to index data into AWS ElasticSearch, it is taking more than 3 days to complete indexing. Is there any alternative way to do faster indexing?

First of all, use Bulk API https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
You can also disable index refreshing and re-enable it after you index whole dataset.

Related

SQL Azure. Create Index recommendation and performance

I got several CREATE INDEX recommendations on Azure SQL S3 tier.
Before going through, I'd like to know some issues during indexing with 10-million records.
Can we know indexing progress or completion time approximately?
Does indexing work in asynchronous (or we can say lazy index) manner? Or it blocks query to the table/database?
Is there anything we need to know about performance degradation during indexing? If so, can we expect amount of degradation?
Does it perform differently from my CREAT INDEX command?
If the database is readonly-georedundant configured, I assume that index configuration itself is replicated either. But does indexing job operate separately?
If the indexing is performed on their own(replicated) database, tier master(S3 tier) to replica(S1) could have different indexing progress. is it correct?
Can we know indexing progress or completion time approximately?
You can get to know amount of space that will be used ,but not index creation time.You can track the progress though using sys.dm_exec_requests
also with SQL2016(azure compatabilty level 130) there is a new DMV called Sys.dm_exec_query_profiles..which can track accurate status better then exec requests DMV..
Does indexing work in asynchronous (or we can say lazy index) manner? Or it blocks query to the table/database?
There are two ways to create Index
1.Online
2.Offline
When you create index online,your table will not be blocked*,since SQL maintains a separate copy of index and updates both indexes parallely
with offline approach, you will experience blocking and table also won't be available
Is there anything we need to know about performance degradation during indexing? If so, can we expect amount of degradation?
You will experience additional IO load,increase in memory..This can't be accurately estimated.
Does it perform differently from my CREATE INDEX command?
Create Index is altogether a seperate statement ,i am not sure what you meant here
If the database is readonly-georedundant configured, I assume that index configuration itself is replicated either. But does indexing job operate separately?
If the indexing is performed on their own(replicated) database, tier master(S3 tier) to replica(S1) could have different indexing progress. is it correct?
Index creation is logged and all the TLOG is replayed on secondary as well.so there is no need to do index rebuilds on secondary..

Collecting Relational Data and Adding to a Database Periodically with Python

I have a project that :
fetches data from active directory
fetches data from different services based on active directory data
aggregates data
about 50000 row have to be added to database in every 15 min
I'm using Postgresql as database and django as ORM tool. But I'm not sure that django is the right tools for such projects. I have to drop and add 50000 rows data and I'm worry about performance.
Is there another way to do such process?
For sure there are other ways, if that's what you're asking. But Django ORM is quite flexible overall, and if you write your queries carefully there will be no significant overhead. 50000 rows in 15 minutes is not really big enough. I am using Django ORM with PostgreSQL to process millions of records a day.
50k rows/15m is nothing to worry about.
But I'd make sure to use bulk_create to avoid 50k of round trips to the database, which might be a problem depending on your database networking setup.
You can write a custom Django management command for this purpose, Then call it like
python manage.py collectdata
Here is the documentation link

AWS Data Pipeline doesn't use DynamoDB's indexes

I have a data pipeline running every hour, running a HiveCopyActivity to select the past hour's data from DynamoDB into S3. The table I'm selecting from has a hash key VisitorID and range key Timestamp, around 4 million rows and is 7.5GB in size. To reduce the time taken for the job, I created a global secondary index on Timestamp but after monitoring Cloudwatch, it seems that HiveCopyActivity doesn't use the index. I've read through all the relevant AWS documentation but can't find any mention of indexes.
Is there a way to force data pipeline to use an index while filtering like this? If not, are there any alternative applications which could transfer hourly (or any other period) data from DynamoDB to S3?
The DynamoDB EMR Hive adapter currently doesn't support using indexes, unfortunately. You would need to write your own sweeper that scans the index and outputs it to S3 - you can check out https://github.com/awslabs/dynamodb-import-export-tool for some basics to implementing the import/export pipe. The library is essentially a parallel scan framework for sweeping DDB tables.

Migrate from SQL to MongoDB?

I am researching on the possibilities of migrating data from SQL 2012 to mongoDB. And my manager specifically asked me to see the time it takes to process billions of rows in SQL and MongoDB to make a decision to migrate or not. Any recommendations or suggestions or places I should visit to research more?
So far I have done
installed MongoDB in my development environment
i have been able to connect to MongoDB, created Databases and collections
Questions I have now
3. how to import the database in SQL to Mongo (say migrate Adventure Works)
Thanks In Advance!
Some best practices I learned the hard way.
Do a partial import
When planning a MongoDB cluster, you need to have an idea how big the average document size is. In order to do that, import some 10k records of your data. This gives you an idea on how long the actual import will take in orders of magnitude:
where t is the time it took to import n documents of c.
Repeat this for all target collections. After that, issue a
db.stats()
in the mongo shell. You will be presented with some size statistics. You now have approximations to two key factors: the time it takes to import (by summing up the results of above calculation) and the storage space you will need.
Create the indices on the partial import
Create the indices you are going to need. As for time calculations, the same as above applies. But there is a different thing: indices should reside in RAM, so you need to extrapolate the actual RAM you need when all records are migrated.
Chances are that it isn't cost efficient to store all data on one machine, since RAM is getting costly after a certain point (calculation are necessary here). If that is the case, you need to shard.
When sharding: Choose a proper shard key
It can not be overemphasized how important it is to have a proper shard key right from the start: Shard keys can not be changed. Invest some time with the developers to find a proper shard key.
When sharding: Pre-split chunks
The last thing you want during data migration is to have it being delayed by the balancer trying to balance out the chunks. So you should pre-split your chunks and distribute them among your shards.
I've created a Node.js script that replicates an SQL database to MongoDB.
You can find it here.
To use...
Clone the repo:
git clone https://github.com/ashleydavis/sql-to-mongodb
Install dependencies:
cd sql-to-mongodb
npm install
Setup your config:
Edit config.js. Add your SQL connection string and details for your MongoDB database.
Run the script:
node index.js
This can take a while to complete! But when it does you will have a copy of your SQL database in MongoDB. Please let me know if there are any issues.

Zend Search Lucene HTTP 500 Internal Server Error while bulk indexing on small tables

I am just getting started with Zend Search Lucene and am testing on a GoDaddy shared Linux account. Everything is working - I can create and search Lucene Documents. The problem is when I try to index my whole table for the first time I get a HTTP 500 Internal Server Error after about 30 seconds. If I rewrite my query so that I only select 100 rows of my table to index, it works fine.
I have already increased my php memory_limit settings to 128M. The table I'm trying to index is only 3000 rows and I'm indexing a few columns from each row.
Any thoughts?
Zend_Search_Lucene does not work very well for large data-sets in my experience. For that reason I switched the search backend to Apache Lucene in a larger project.
Did you try setting your timeout to something higher than 30seconds (default in php.ini)? The memory threshold can also be exceeded easily with 3000 rows depending on what you're indexing. If you're indexing everything as Text fields, and perhaps you're indexing related data, you can easily gobble that memory up.

Resources